Overview

The documents module provides a complete Retrieval-Augmented Generation (RAG) system — from document upload and ingestion through chunking, embedding, and hybrid search.

Core Models

KnowledgeBase

class KnowledgeBase(BaseModel, WorkspaceAwareModel):
    id = models.UUIDField(primary_key=True)
    name = models.CharField(max_length=200)
    description = models.TextField(blank=True)
    created_by = models.ForeignKey(User, on_delete=models.SET_NULL)
    is_shared_with_workspace = models.BooleanField(default=False)
    documents = models.ManyToManyField(Document, through="KnowledgeBaseDocument")

Knowledge bases group documents together. They can be shared with an entire workspace or restricted to specific teams via KnowledgeBaseTeamAccess.

Document

class Document(BaseModel, WorkspaceAwareModel):
    id = models.UUIDField(primary_key=True)
    name = models.CharField(max_length=500)
    uploaded_by = models.ForeignKey(User, on_delete=models.SET_NULL)

    # Source
    source_type = models.CharField(choices=["file", "web_url", "youtube"])
    source_url = models.URLField(max_length=2048, blank=True)

    # S3 Storage
    s3_key = models.CharField(max_length=1024, blank=True)
    original_filename = models.CharField(max_length=500, blank=True)
    file_type = models.CharField(max_length=20, blank=True)   # "pdf", "docx", etc.
    mime_type = models.CharField(max_length=100, blank=True)
    file_size = models.PositiveBigIntegerField(default=0)

    # Processing
    processing_status = models.CharField(
        choices=["pending", "processing", "ready", "failed"]
    )
    processing_error = models.TextField(blank=True)
    embedding_model = models.CharField(max_length=100, blank=True)
    chunk_count = models.PositiveIntegerField(default=0)

Documents support three source types:

file — Uploaded via S3 presigned URL (PDF, DOCX, TXT, CSV, MD, PPTX, XLSX)
web_url — Ingested from a web page using Trafilatura
youtube — Transcript extracted from YouTube videos

DocumentChunk

class DocumentChunk(BaseModel):
    id = models.UUIDField(primary_key=True)
    document = models.ForeignKey(Document, on_delete=models.CASCADE)
    chunk_index = models.PositiveIntegerField()
    content = models.TextField()

    # Vector embedding (pgvector)
    embedding = VectorField(dimensions=1536)
    embedding_model = models.CharField(max_length=100)

    # Citation metadata
    page_number = models.PositiveIntegerField(null=True)
    section_heading = models.CharField(max_length=500, blank=True)
    token_count = models.PositiveIntegerField(default=0)
    chunk_metadata = models.JSONField(default=dict)

    class Meta:
        indexes = [
            HnswIndex(fields=["embedding"], m=16, ef_construction=64, opclasses=["vector_cosine_ops"]),
            models.Index(fields=["document", "chunk_index"]),
        ]

Each chunk stores its text content, a 1536-dimensional embedding vector, and metadata for citations (page number, section heading).

Supporting Models

Model	Purpose
`KnowledgeBaseDocument`	Through table for KB ↔ Document M2M
`KnowledgeBaseTeamAccess`	Team-level read/write access to KBs
`ChatSessionKnowledgeBase`	Links chat sessions to KBs for RAG search
`DocumentProcessingTask`	Tracks Celery task status for ingestion

Processing Pipeline

Document created (pending)
    │
    ▼
Celery task dispatched
    │
    ├─ File → Download from S3 → Parse (Docling/PyMuPDF) → Chunk → Embed → Save
    ├─ Web URL → Extract (Trafilatura) → Chunk → Embed → Save
    └─ YouTube → Extract transcript → Chunk → Embed → Save
    │
    ▼
Document status → "ready"

Retrieval Flow

User query in chat
    │
    ▼
Resolve active KBs for session
    │
    ▼
Get documents with status="ready"
    │
    ├─ Semantic search (pgvector cosine similarity) → top 20
    ├─ Keyword search (PostgreSQL full-text) → top 20
    │
    ▼
Reciprocal Rank Fusion → top 5 results
    │
    ▼
Return to agent as tool result

Configuration

Setting	Default	Description
`DOCUMENT_EMBEDDING_MODEL`	`text-embedding-3-small`	OpenAI embedding model
`DOCUMENT_EMBEDDING_DIMENSIONS`	`1536`	Vector dimensions
`DOCUMENT_CHUNK_MAX_TOKENS`	`512`	Max tokens per chunk
`DOCUMENT_MAX_UPLOAD_SIZE`	`50MB`	Max file upload size
`DOCUMENT_ALLOWED_EXTENSIONS`	`pdf,docx,doc,txt,csv,md,pptx,xlsx`	Allowed file types
`DOCUMENT_S3_PREFIX`	`documents`	S3 key prefix
`DOCUMENT_PRESIGNED_URL_EXPIRY`	`3600`	Presigned URL TTL (seconds)

Permissions

Action	Who Can Access
Read document	Uploader, or workspace member (if KB is shared)
Write/delete document	Uploader, workspace admin, or workspace owner
Access KB	Workspace-shared, or team with `KnowledgeBaseTeamAccess`

On this page