Epsilla
HomeDiscordTwitterGithubEmail
  • Welcome
    • Register and Login
    • Explore App Portal
  • Build Your First AI Agent
    • Create a Knowledge Base
    • Set Up Your AI Agent
    • Publish Your AI Agent
  • Knowledge Base
    • Local Files
    • Website
    • Google Drive
    • S3
    • Notion
    • Share Point
    • Google Cloud Storage
    • Azure Blob Storage
    • Confluence
    • Jira
    • Advanced Settings
      • Auto Sync
      • Embedding
      • Data Parsing
      • Data Chunking
      • Hypothetical Questions
      • Webhook
      • Meta Data
    • Data Storage
    • Programmatically Manage Knowledge Bases
  • Application
    • Create New AI Agent
    • Basic Chat Agent Config
    • Basic Smart Search Agent Config
    • Advanced Workflow Customization
    • Publish and Deployment
    • User Engagement Analytics
  • Evaluation
    • Create New Evaluation
    • Run Evaluation
    • Evaluation Run History
  • Integration
  • Team Member Management
  • Project Management
  • Billing Management
  • Release Notes
  • Epsilla Vector Database
    • Overview
    • Quick Start
      • Run with Docker
      • Epsilla Cloud
    • User Manual
      • Connect to a database
      • Create a new table
      • Drop a table
      • Delete a database
      • Insert records
      • Upsert records
      • Search the top K semantically similar records
      • Retrieve records (with filters and pagination)
      • Delete records
      • Performance Tuning
    • Advanced Topics
      • Embeddings
      • Dense vector vs. sparse vector
      • Hybrid Search
    • Integrations
      • OpenAI
      • Mistral AI
      • Jina AI
      • Voyage AI
      • Mixedbread AI
      • Nomic AI
    • Roadmap
Powered by GitBook
On this page
  • Split by Sentences
  • Markdown Splitter
  • Recursive Splitting
  • Smart Chunking
  1. Knowledge Base
  2. Advanced Settings

Data Chunking

PreviousData ParsingNextHypothetical Questions

Last updated 7 months ago

Data Chunking in Epsilla involves selecting a method to divide data into smaller, manageable segments for efficient processing and retrieval. Due to the context window limitation of Large Language Models, we cannot pass the entire knowledge base to the LLM during generation. Therefore, we need to chunk the data into smaller pieces, create a semantic index on top of them (via embedding), and retrieve the most relevant pieces of information during each LLM generation.

The Data Chunking option allows users to choose from various chunking modes.

Split by Sentences

The "Split by sentence" option in data chunking involves breaking the text into chunks that respect natural sentence boundaries, ensuring that each chunk contains several complete sentences rather than incomplete sentence cut in the middle.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

Markdown Splitter

The "Markdown Splitter" breaks down Markdown documents into smaller chunks based on different structural elements, such as headings and paragraphs. This allows for a more organized and efficient processing of the content, making it suitable for indexing, querying, and further transformations.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

Recursive Splitting

The "Recursive Splitting" option in data chunking involves iteratively breaking the text into smaller segments using predefined characters or separators, ensuring each chunk remains consistent in size and context.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

Smart Chunking

The "Smart Chunking" option is Epsilla's proprietary chunking method for Markdown documents with a hierarchical structure (titles, subtitles, etc.). It balances chunk cohesiveness by attempting to keep the same section within one chunk, unless the chunk size becomes too large. Additionally, it attaches the hierarchical titles of parent sections to each chunk to maintain context. It is best used in combination with to achieve optimal data processing performance.

Advanced Parsing with Tables/Charts