Epsilla
HomeDiscordTwitterGithubEmail
  • Welcome
    • Register and Login
    • Explore App Portal
  • Build Your First AI Agent
    • Create a Knowledge Base
    • Set Up Your AI Agent
    • Publish Your AI Agent
  • Knowledge Base
    • Local Files
    • Website
    • Google Drive
    • S3
    • Notion
    • Share Point
    • Google Cloud Storage
    • Azure Blob Storage
    • Confluence
    • Jira
    • Advanced Settings
      • Auto Sync
      • Embedding
      • Data Parsing
      • Data Chunking
      • Hypothetical Questions
      • Webhook
      • Meta Data
    • Data Storage
    • Programmatically Manage Knowledge Bases
  • Application
    • Create New AI Agent
    • Basic Chat Agent Config
    • Basic Smart Search Agent Config
    • Advanced Workflow Customization
    • Publish and Deployment
    • User Engagement Analytics
  • Evaluation
    • Create New Evaluation
    • Run Evaluation
    • Evaluation Run History
  • Integration
  • Team Member Management
  • Project Management
  • Billing Management
  • Release Notes
  • Epsilla Vector Database
    • Overview
    • Quick Start
      • Run with Docker
      • Epsilla Cloud
    • User Manual
      • Connect to a database
      • Create a new table
      • Drop a table
      • Delete a database
      • Insert records
      • Upsert records
      • Search the top K semantically similar records
      • Retrieve records (with filters and pagination)
      • Delete records
      • Performance Tuning
    • Advanced Topics
      • Embeddings
      • Dense vector vs. sparse vector
      • Hybrid Search
    • Integrations
      • OpenAI
      • Mistral AI
      • Jina AI
      • Voyage AI
      • Mixedbread AI
      • Nomic AI
    • Roadmap
Powered by GitBook
On this page
  • What is Hybrid Search
  • Hybrid Search Architecture
  • Epsilla Hybrid Search API
  • Reranker
  1. Epsilla Vector Database
  2. Advanced Topics

Hybrid Search

PreviousDense vector vs. sparse vectorNextIntegrations

Last updated 1 year ago

What is Hybrid Search

Hybrid search is an advanced retrieval technique that enhances search accuracy and relevance by integrating multiple search algorithms. Typically it combines traditional keyword-based search with the contextual understanding offered by vector semantic search.

Traditionally, search engines relied on keyword-based algorithms, excelling in exact keyword matching but faltering with synonyms and typos, thereby missing contextually relevant results. The advent of machine learning introduced vector or semantic search, utilizing vector embeddings to grasp and search data semantically, supporting multi-lingual and multi-modal queries while being robust to typos. However, it sometimes overlooks crucial keywords and is dependent on the quality of embeddings, sensitive to out-of-domain terms.

Hybrid search leverages the strengths of both, employing sparse vectors (generated by algorithms like BM25 and SPLADE) for keyword precision and dense vectors (from models like , ) for contextual understanding. For example, when someone searches "How to catch a Rainbow Trout," hybrid search leverages dense vectors to interpret "catch" within the context of fishing, so it doesn't confuse with semantic meaning of "catching a basketball" or so. Simultaneously, it employs sparse vectors to ensure terms like "Rainbow Trout" are matched with pinpoint accuracy. This approach not only bridges the gap between keyword specificity and semantic context but also ensures a nuanced search experience that accurately interprets complex queries.

Hybrid Search Architecture

Here is a hybrid search architecture. When the user query comes, it will trigger multiple retrievers to retrieve the top K documents based on their dense vector embeddings, sparse vector embeddings, (and potentially other indices) concurrently. Each retriever gives back a list of documents as candidate results. Then, the documents go through a reranker component to calculate the final rank of all documents from all retrievers, and give the final retrieval result.

Epsilla Hybrid Search API

Here is a hybrid search example:

search_engine = db.as_search_engine( 
).add_retriever(
     table_name="MyTable",
     query_index="DenseIndex",
     limit=4,
).add_retriever(
     table_name="MyTable",
     query_field="SparseIndex",
     query_vector={
        "indices": [32, 103, 2345, 10384],
        "values": [0.074163, 0.238575, 0.141831, 0.117338]
     },
     limit=2,
).set_reranker(
     type="rrf"
)

search_result = search_engine.search(
    "Where can I find a serene environment, ideal for relaxation and introspection?"
)
const searchEngine = db
    .asSearchEngine()
    .addRetriever({
        table: 'MyTable',
        indexName: 'MyIndex1',
        limit: 4,
    })
    .addRetriever({
        table: 'MyTable',
        queryField: 'SparseIndex',
        queryVector: {
            indices: [32, 103, 2345, 10384],
            values: [0.074163, 0.238575, 0.141831, 0.117338]
        },
        limit: 2,
    })
    .setReranker('rrf');

const searchResults = await searchEngine.search(
    'Where can I find a serene environment, ideal for relaxation and introspection?'
);

Now let's breakdown hybrid search API step by step.

Prerequisite:

# Open source
db = vectordb.Client(...)
db.use_db(db_name="...")

# Epsilla cloud
db = cloud.Client(...).vectordb(...)
// Open source
const db = new epsillajs.EpsillaDB(...);
db.useDB(...);

// Epsilla cloud
const cloud = new epsillajs.EpsillaCloud(...);
const db = new epsillajs.VectorDB(...); 
await db.connect();

Step 1. Convert the vector database as a search engine

search_engine = db.as_search_engine()
const searchEngine = db.asSearchEngine()

Step 2. Add retrievers

You can add one or many retrievers to the search engine. A retriever can retrieve documents on an index, or an embedding field (either dense or sparse embedding).

Add retriever on an index:

search_engine.add_retriever(
  primary_key_field="ID",                # (Optional) Which field to use as document's ID. Default value is "ID"
  table_name="MyTable",                  # The name of the table to retrieve against.
  query_index="MyIndex",                 # The index name to retrieve against.
  limit=2,                               # The top K result to return.
  response_fields=["Doc"],               # (Optional) which fields to be included in
                                         # the response. If not provided, will include
                                         # all fields in the table.
  filter="ID < 6 AND Doc <> 'London'",   # (Optional) a boolean expression for filtering
                                         # out the results.
)
searchEngine.addRetriever(
  'MyTable',                                 // The name of the table to retrieve against.
  {
    primaryKeyField: 'ID',                   // (Optional) Which field to use as document's ID. Default value is 'ID'
    queryIndex: 'MyIndex',                   // The index name to retrieve against.
    limit: 2,                                // The top K result to return.
    response: ["Doc"],                       // (Optional) which fields to be included in
                                             // the response. If not provided, will include
                                             // all fields in the table.
    filter: 'ID < 6 AND Doc <> \'London\'',  // (Optional) filter: a boolean expression for filtering
                                             // out the results.
  }
);

Add retriever on an embedding field:

search_engine.add_retriever(
  primary_key_field="ID",                # (Optional) Which field to use as document's ID. Default value is "ID"
  table_name="MyTable",                  # The name of the table to retrieve against.
  query_field="SparseIndex",             # The embedding field
  query_vector={                         # The vector embedding of the query (either dense or sparse)
    "indices": [32, 103, 2345, 10384],
    "values": [0.074163, 0.238575, 0.141831, 0.117338]
  },
  limit=2,                               # The top K result to return.
  response_fields=["Doc"],               # (Optional) which fields to be included in
                                         # the response. If not provided, will include
                                         # all fields in the table.
  filter="ID < 6 AND Doc <> 'London'",   # (Optional) a boolean expression for filtering
                                         # out the results.
)
searchEngine.addRetriever(
  'MyTable',                                 // The name of the table to retrieve against.
  {
    primaryKeyField: 'ID',                   // (Optional) Which field to use as document's ID. Default value is 'ID'
    queryField: 'SparseIndex',               // The embedding field
    queryVector: {                           // The vector embedding of the query (either dense or sparse)
      indices: [32, 103, 2345, 10384],
      values: [0.074163, 0.238575, 0.141831, 0.117338]
    },
    limit: 2,                                // The top K result to return.
    response: ["Doc"],                       // (Optional) which fields to be included in
                                             // the response. If not provided, will include
                                             // all fields in the table.
    filter: 'ID < 6 AND Doc <> \'London\'',  // (Optional) filter: a boolean expression for filtering
                                             // out the results.
  }
);

If you only add one retriever, the search degrades to a simple query.

Step 3. Set reranker

search_engine.set_reranker(type="rrf")
searchEngine.setReranker('rrf');

Step 4. Search

Now the search engine is ready to use. Call the search API to get the hybrid search results.

search_result = search_engine.search(
    "Where can I find a serene environment, ideal for relaxation and introspection?"
)
const searchResults = await searchEngine.search(
    'Where can I find a serene environment, ideal for relaxation and introspection?'
);

Reranker

Epsilla supports a growing number of rerankers:

Reciprocal Rank Fusion (RRF)

RRF uses the following formula to determine the score for ranking each document:

score = 0.0
for i, retriever in retrievers:
    if doc in retriever.results:
        score += weights[i] / ( k + rank( retriever.results, doc ) )
return score

# k is a ranking constant, default to 50
# i is the index of the retriever among all retrievers
# retriever.results is the list of documents retrieved from the retriever
# doc is a document we calculate the final ranking
# weights[i] is the weight we give to the i-th retriever.
# rank( retriever.results, doc ) is doc's rank within retriever.results starting from 1

You can see the higher the rank of a document in a retriever, the higher the score. The document that ranks ahead in more retrievers will be ranked ahead in the final reranking result.

Here is the full API for using Reciprocal Rank Fusion reranker:

search_engine.set_reranker(
    type="rrf",          # or "reciprocal_rank_fusion"
    weights=[0.3, 0.7],  # (Optional) give a weight for different retrivers, will use [1, 1, ...] if not provided
    k=50,                # (Optional) ranking constant, default to 50 if not provided.
    limit=3,             # (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
)
searchEngine.setReranker(
    'rrf',                    // or 'reciprocal_rank_fusion'
    {
        weights: [0.3, 0.7],  // (Optional) give a weight for different retrivers, will use [1, 1, ...] if not provided
        k: 50,                // (Optional) ranking constant, default to 50 if not provided.
        limit: 3,             // (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
    }
);

Relative Score Fusion (RSF)

RSF normalizes and combines scores from multiple retrievers to rank each document. The process involves:

  1. Normalizing distances from each source to a score between 0 and 1, where 0 indicates the furthest and 1 the closest to the query among the retrieval results of one retriever.

  2. Aggregating these normalized scores for each document across all sources.

  3. Ranking documents based on their aggregated scores, with higher scores indicating higher relevance.

Here is the full API for using Relative Score Fusion reranker:

search_engine.set_reranker(
    type="rsf",          # or "relative_score_fusion"
    limit=3,             # (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
)
searchEngine.setReranker(
    'rsf',                    // or 'relative_score_fusion'
    {
        limit: 3,             // (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
    }
);

Distribution-Based Score Fusion (DBSF)

Here is the full API for using Distribution-Based Score Fusion reranker:

search_engine.set_reranker(
    type="dbsf",         # or "distribution_based_score_fusion"
    scale_ranges=[       # Provide the scale range for each retriever based on the embedding model used
        [0.4, 0.8],
        [0.18, 0.3]
    ],
    limit=3,             # (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
)
searchEngine.setReranker(
    'dbsf',                  // or 'distribution_based_score_fusion'
    {
        scaleRanges: [       // Provide the scale range for each retriever based on the embedding model used
            [0.4, 0.8],
            [0.18, 0.3]
        ],
        limit: 3,            // (Optional) limit the top K results after reranking. If not provided, all documents from all retrievers will be returned
    }
);

Assume you have already .

If more than one retriever is added to the search engine, you must set a reranker. Here is a simplest example for adding a reciprocal rank fusion (rrf) reranker. Go to the section to learn more.

DBSF is a more advanced reranking algorithm than RSF, which takes the embedding distribution of different embedding models, and normalize within the 3 sigma range from median of the embedding distance. . To use DBSF, you will need to provide a scale range for each embedding or index field. if you need help to setup the parameters.

connected to a vector database
Learn more here
Talk to us
reranker
OpenAI/text-embedding-ada-002
sentence-transformers/all-MiniLM-L6-v2
Hybrid Search Architecture