HyperSaaS
BackendDocuments & RAG

Overview

Knowledge base management with document ingestion and RAG retrieval.

The documents module provides a complete Retrieval-Augmented Generation (RAG) system — from document upload and ingestion through chunking, embedding, and hybrid search.

Core Models

KnowledgeBase

class KnowledgeBase(BaseModel, WorkspaceAwareModel):
    id = models.UUIDField(primary_key=True)
    name = models.CharField(max_length=200)
    description = models.TextField(blank=True)
    created_by = models.ForeignKey(User, on_delete=models.SET_NULL)
    is_shared_with_workspace = models.BooleanField(default=False)
    documents = models.ManyToManyField(Document, through="KnowledgeBaseDocument")

Knowledge bases group documents together. They can be shared with an entire workspace or restricted to specific teams via KnowledgeBaseTeamAccess.

Document

class Document(BaseModel, WorkspaceAwareModel):
    id = models.UUIDField(primary_key=True)
    name = models.CharField(max_length=500)
    uploaded_by = models.ForeignKey(User, on_delete=models.SET_NULL)

    # Source
    source_type = models.CharField(choices=["file", "web_url", "youtube"])
    source_url = models.URLField(max_length=2048, blank=True)

    # S3 Storage
    s3_key = models.CharField(max_length=1024, blank=True)
    original_filename = models.CharField(max_length=500, blank=True)
    file_type = models.CharField(max_length=20, blank=True)   # "pdf", "docx", etc.
    mime_type = models.CharField(max_length=100, blank=True)
    file_size = models.PositiveBigIntegerField(default=0)

    # Processing
    processing_status = models.CharField(
        choices=["pending", "processing", "ready", "failed"]
    )
    processing_error = models.TextField(blank=True)
    embedding_model = models.CharField(max_length=100, blank=True)
    chunk_count = models.PositiveIntegerField(default=0)

Documents support three source types:

  • file — Uploaded via S3 presigned URL (PDF, DOCX, TXT, CSV, MD, PPTX, XLSX)
  • web_url — Ingested from a web page using Trafilatura
  • youtube — Transcript extracted from YouTube videos

DocumentChunk

class DocumentChunk(BaseModel):
    id = models.UUIDField(primary_key=True)
    document = models.ForeignKey(Document, on_delete=models.CASCADE)
    chunk_index = models.PositiveIntegerField()
    content = models.TextField()

    # Vector embedding (pgvector)
    embedding = VectorField(dimensions=1536)
    embedding_model = models.CharField(max_length=100)

    # Citation metadata
    page_number = models.PositiveIntegerField(null=True)
    section_heading = models.CharField(max_length=500, blank=True)
    token_count = models.PositiveIntegerField(default=0)
    chunk_metadata = models.JSONField(default=dict)

    class Meta:
        indexes = [
            HnswIndex(fields=["embedding"], m=16, ef_construction=64, opclasses=["vector_cosine_ops"]),
            models.Index(fields=["document", "chunk_index"]),
        ]

Each chunk stores its text content, a 1536-dimensional embedding vector, and metadata for citations (page number, section heading).

Supporting Models

ModelPurpose
KnowledgeBaseDocumentThrough table for KB ↔ Document M2M
KnowledgeBaseTeamAccessTeam-level read/write access to KBs
ChatSessionKnowledgeBaseLinks chat sessions to KBs for RAG search
DocumentProcessingTaskTracks Celery task status for ingestion

Processing Pipeline

Document created (pending)


Celery task dispatched

    ├─ File → Download from S3 → Parse (Docling/PyMuPDF) → Chunk → Embed → Save
    ├─ Web URL → Extract (Trafilatura) → Chunk → Embed → Save
    └─ YouTube → Extract transcript → Chunk → Embed → Save


Document status → "ready"

Retrieval Flow

User query in chat


Resolve active KBs for session


Get documents with status="ready"

    ├─ Semantic search (pgvector cosine similarity) → top 20
    ├─ Keyword search (PostgreSQL full-text) → top 20


Reciprocal Rank Fusion → top 5 results


Return to agent as tool result

Configuration

SettingDefaultDescription
DOCUMENT_EMBEDDING_MODELtext-embedding-3-smallOpenAI embedding model
DOCUMENT_EMBEDDING_DIMENSIONS1536Vector dimensions
DOCUMENT_CHUNK_MAX_TOKENS512Max tokens per chunk
DOCUMENT_MAX_UPLOAD_SIZE50MBMax file upload size
DOCUMENT_ALLOWED_EXTENSIONSpdf,docx,doc,txt,csv,md,pptx,xlsxAllowed file types
DOCUMENT_S3_PREFIXdocumentsS3 key prefix
DOCUMENT_PRESIGNED_URL_EXPIRY3600Presigned URL TTL (seconds)

Permissions

ActionWho Can Access
Read documentUploader, or workspace member (if KB is shared)
Write/delete documentUploader, workspace admin, or workspace owner
Access KBWorkspace-shared, or team with KnowledgeBaseTeamAccess

On this page