Data Parsing
Last updated
Last updated
Data Parsing in Epsilla involves selecting a method to process and extract content from different types of data sources, such as PDFs, CSVs, or JSONL files.
The Data Parsing option allows users to choose from various parsing modes, including Auto, which automatically detects the data format, or manual modes like PDF, CSV, and JSONL, depending on the source type. Additionally, advanced parsing options are available for processing tables and charts.
In most cases, the default Auto mode is sufficient. It automatically detects the data file format and leverages different types of file loaders accordingly. If your data source contains multiple file types (such as PDF, DOC, TXT, JSON, CSV, HTML, etc.), Auto is your best choice.
If you only have one type of file in your knowledge base, you can optionally use PDF, CSV, or JSONL as your parsing option.
When using CSV and JSONL as parsing option, Epsilla automatically detects the schema of uploaded CSV and JSON files, creating additional metadata fields for each column in a CSV or each object field in a JSON file. This ensures that all relevant data attributes are seamlessly integrated into the knowledge base. Additionally, users can define custom semantic indices on these fields, enabling advanced search and retrieval capabilities tailored to their specific needs. This functionality provides a flexible and efficient way to structure and index data for improved discoverability and analysis.
This option is currently available only for the Enterprise tier. It provides superior data extraction accuracy using an industry-leading Large Vision Language Model (VLM) technique provided by CambioML. At present, we support only PDF files. This option can accurately extract text, nested tables, and charts from PDF files in any layout. Read more in our white paper.
Talk to us if you want to use this technology without an Enterprise tier to test it out. We'd love to enable it for you and support your use case!