> For the complete documentation index, see [llms.txt](https://epsilla-inc.gitbook.io/epsilladb/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://epsilla-inc.gitbook.io/epsilladb/knowledge-base/advanced-settings/data-chunking.md).

# Data Chunking

**Data Chunking** in Epsilla involves selecting a method to divide data into smaller, manageable segments for efficient processing and retrieval. Due to the context window limitation of Large Language Models, we cannot pass the entire knowledge base to the LLM during generation. Therefore, we need to chunk the data into smaller pieces, create a semantic index on top of them (via embedding), and retrieve the most relevant pieces of information during each LLM generation.

The Data Chunking option allows users to choose from various chunking modes.

<figure><img src="/files/0CD008uFpbulXxIM6Fe1" alt="" width="563"><figcaption></figcaption></figure>

### Split by Sentences

<figure><img src="/files/oe8ktAbGSRezyr40Lf3f" alt="" width="563"><figcaption></figcaption></figure>

The "Split by sentence" option in data chunking involves breaking the text into chunks that respect natural sentence boundaries, ensuring that each chunk contains several complete sentences rather than incomplete sentence cut in the middle.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

### Markdown Splitter

<figure><img src="/files/xF8aT4j1HzQciQa3DYUY" alt="" width="563"><figcaption></figcaption></figure>

The "Markdown Splitter" breaks down Markdown documents into smaller chunks based on different structural elements, such as headings and paragraphs. This allows for a more organized and efficient processing of the content, making it suitable for indexing, querying, and further transformations.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

### Recursive Splitting

<figure><img src="/files/piomcrXBsgPo3UCe5MHE" alt="" width="563"><figcaption></figcaption></figure>

The "Recursive Splitting" option in data chunking involves iteratively breaking the text into smaller segments using predefined characters or separators, ensuring each chunk remains consistent in size and context.

The chunk size controls the maximum number of characters in a chunk, while the chunk overlap determines how many characters will overlap between chunks.

### Smart Chunking

<figure><img src="/files/We5y8Fec8G0PXitEGVEe" alt="" width="563"><figcaption></figcaption></figure>

The "Smart Chunking" option is Epsilla's proprietary chunking method for Markdown documents with a hierarchical structure (titles, subtitles, etc.). It balances chunk cohesiveness by attempting to keep the same section within one chunk, unless the chunk size becomes too large. Additionally, it attaches the hierarchical titles of parent sections to each chunk to maintain context. It is best used in combination with [Advanced Parsing with Tables/Charts](/epsilladb/knowledge-base/advanced-settings/data-parsing.md#when-to-use-advanced-parsing-with-tables-charts) to achieve optimal data processing performance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://epsilla-inc.gitbook.io/epsilladb/knowledge-base/advanced-settings/data-chunking.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
