Hybrid Retrieval

HyperSaaS uses a hybrid retrieval strategy combining vector similarity search and GIN-indexed PostgreSQL full-text search, fused with Reciprocal Rank Fusion (RRF) — with an optional LLM reranking stage.

Search Function

def search_documents(
    query: str,
    session_id: str,
    workspace_id: str,
    top_k: int = 5,
    semantic_candidates: int = 20,
    keyword_candidates: int = 20,
) -> list[dict]:

Called by the search_knowledge_base agent tool during chat conversations.

Retrieval Flow

User query
    │
    ▼
1. Resolve active documents
    │  - Get KBs attached to ChatSession
    │  - Get Documents in those KBs with status="ready"
    │  - Scope to workspace
    │
    ├──────────────────────────┐
    │                          │
    ▼                          ▼
2a. Semantic Search        2b. Keyword Search
    (pgvector cosine)          (PostgreSQL FTS)
    → top 20 candidates        → top 20 candidates
    │                          │
    └──────────┬───────────────┘
               │
               ▼
3. Reciprocal Rank Fusion
    → top 5 results (or a 20-candidate pool when reranking)
               │
               ▼
4. Optional: LLM rerank (DOCUMENT_RERANKER=llm) → top 5
               │
               ▼
5. Return results with citations

Semantic Search

Embeds the query and finds the closest document chunks using pgvector's cosine distance:

from pgvector.django import CosineDistance

query_embedding = embeddings.embed_query(query)

chunks = (
    DocumentChunk.objects
    .filter(document_id__in=document_ids)
    .annotate(distance=CosineDistance("embedding", query_embedding))
    .order_by("distance")
    [:semantic_candidates]
)

Distance is converted to similarity: score = 1.0 - distance.

The HNSW index (m=16, ef_construction=64, cosine ops) enables approximate nearest neighbor search — fast even with millions of chunks.

Keyword Search

Uses PostgreSQL full-text search over a precomputed, GIN-indexed search_vector column — the @@ operator is index-eligible, so keyword search stays fast at any corpus size (no per-query tsvector computation):

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.db.models import F

search_query = SearchQuery(query, search_type="websearch", config=FTS_LANGUAGE)

chunks = (
    DocumentChunk.objects
    .filter(
        document_id__in=document_ids,
        search_vector=search_query,   # @@ operator → GIN index
    )
    .annotate(rank=SearchRank(F("search_vector"), search_query))
    .order_by("-rank")
    [:keyword_candidates]
)

The search_vector column is populated once at ingestion (chunks are immutable), so no triggers are needed. The websearch mode supports quoted phrases and -exclusions.

FTS language (`DOCUMENT_FTS_LANGUAGE`)

The text-search config defaults to simple (no stemming, no stopword removal) — the safe choice for mixed-language corpora, since a single-language stemmer corrupts whichever language it doesn't match. The trade-off: natural-language questions match poorly on the keyword leg ("profitable" won't match "profitability"); the semantic leg covers exactly that gap. For single-language corpora, set DOCUMENT_FTS_LANGUAGE=english (or turkish, etc.) and re-ingest so stored vectors match the query config.

Reciprocal Rank Fusion

RRF merges the two ranked lists without a learned fusion model:

def _reciprocal_rank_fusion(
    semantic_results: list,
    keyword_results: list,
    k: int = 60,
    top_k: int = 10,
) -> list:
    scores = {}
    for rank, result in enumerate(semantic_results):
        scores[chunk_id] = 1 / (k + rank + 1)
    for rank, result in enumerate(keyword_results):
        scores[chunk_id] += 1 / (k + rank + 1)
    return sorted(scores, reverse=True)[:top_k]

The constant k=60 reduces tail-heavy bias. Chunks appearing in both lists get higher fused scores.

Example scores:

Chunk	Semantic Rank	Keyword Rank	RRF Score
A	#0	#0	`1/61 + 1/61 = 0.0328`
B	#2	—	`1/63 = 0.0159`
C	—	#1	`1/62 = 0.0161`

Reranking (optional)

RRF ranks by list position only — it never reads the chunk text. An optional second stage sends the fused candidate pool to a small LLM that reorders it by actual relevance to the query:

# documents/reranker.py
results = rerank(query, fused_pool, top_k=5)   # listwise LLM rerank

Aspect	Behavior
Default	Off (`DOCUMENT_RERANKER=none`) — retrieval is unchanged until you enable it
Strategy	`llm` — listwise rerank via `DOCUMENT_RERANK_MODEL` (default `gpt-4o-mini`), no extra packages or vendors
Pool	`DOCUMENT_RERANK_CANDIDATES` (default 20) fused candidates, each truncated to 600 chars
Safety	Fail-open: any reranker error returns the pre-rerank ordering — retrieval never breaks
Robustness	The model's ranking is sanitised (invalid/duplicate indices dropped, omissions appended) so no chunk is lost
Cost	~1 small-model call and ~0.5–1s latency per retrieval

Enable it only after measuring lift on your own documents — see Retrieval Evaluation for the go/no-go workflow.

Result Format

Each result includes full citation metadata:

{
  "chunk_id": "uuid",
  "content": "The chunk text content...",
  "document_id": "uuid",
  "document_name": "Product Guide.pdf",
  "source_type": "file",
  "source_url": "",
  "chunk_index": 5,
  "page_number": 12,
  "section_heading": "Installation",
  "chunk_metadata": {},
  "score": 0.0328
}

For YouTube sources, source_url contains the video URL and chunk metadata includes timestamps.

Agent Tool Integration

The RAG tool wraps search_documents as a plain function, then each agent framework adds its own decorator:

# documents/rag_tool.py — framework-agnostic
def search_knowledge_base_impl(query: str, session) -> str:
    results = search_documents(
        query=query,
        session_id=str(session.id),
        workspace_id=str(session.workspace_id),
    )
    return json.dumps(results)

The agent decides when to call the search tool based on the user's question. Results are returned as context for the LLM to synthesize a response with citations.

URL Ingest Tool

The ingest_url tool allows the agent to add new content during conversation:

def ingest_url_impl(url: str, session) -> str:
    # Auto-detects YouTube vs web_url
    # Creates Document + auto-creates session KB
    # Dispatches Celery ingestion task
    return json.dumps({"status": "processing", "document_id": doc.id})

This enables users to say "read this article" or "watch this video" and have it ingested into the session's knowledge base in real time.

On this page