BackendDocuments & RAG
Overview
Knowledge base management with document ingestion and RAG retrieval.
The documents module provides a complete Retrieval-Augmented Generation (RAG) system — from document upload and ingestion through chunking, embedding, and hybrid search.
Core Models
KnowledgeBase
class KnowledgeBase(BaseModel, WorkspaceAwareModel):
id = models.UUIDField(primary_key=True)
name = models.CharField(max_length=200)
description = models.TextField(blank=True)
created_by = models.ForeignKey(User, on_delete=models.SET_NULL)
is_shared_with_workspace = models.BooleanField(default=False)
documents = models.ManyToManyField(Document, through="KnowledgeBaseDocument")Knowledge bases group documents together. They can be shared with an entire workspace or restricted to specific teams via KnowledgeBaseTeamAccess.
Document
class Document(BaseModel, WorkspaceAwareModel):
id = models.UUIDField(primary_key=True)
name = models.CharField(max_length=500)
uploaded_by = models.ForeignKey(User, on_delete=models.SET_NULL)
# Source
source_type = models.CharField(choices=["file", "web_url", "youtube"])
source_url = models.URLField(max_length=2048, blank=True)
# S3 Storage
s3_key = models.CharField(max_length=1024, blank=True)
original_filename = models.CharField(max_length=500, blank=True)
file_type = models.CharField(max_length=20, blank=True) # "pdf", "docx", etc.
mime_type = models.CharField(max_length=100, blank=True)
file_size = models.PositiveBigIntegerField(default=0)
# Processing
processing_status = models.CharField(
choices=["pending", "processing", "ready", "failed"]
)
processing_error = models.TextField(blank=True)
embedding_model = models.CharField(max_length=100, blank=True)
chunk_count = models.PositiveIntegerField(default=0)Documents support three source types:
- file — Uploaded via S3 presigned URL (PDF, DOCX, TXT, CSV, MD, PPTX, XLSX)
- web_url — Ingested from a web page using Trafilatura
- youtube — Transcript extracted from YouTube videos
DocumentChunk
class DocumentChunk(BaseModel):
id = models.UUIDField(primary_key=True)
document = models.ForeignKey(Document, on_delete=models.CASCADE)
chunk_index = models.PositiveIntegerField()
content = models.TextField()
# Vector embedding (pgvector)
embedding = VectorField(dimensions=1536)
embedding_model = models.CharField(max_length=100)
# Citation metadata
page_number = models.PositiveIntegerField(null=True)
section_heading = models.CharField(max_length=500, blank=True)
token_count = models.PositiveIntegerField(default=0)
chunk_metadata = models.JSONField(default=dict)
class Meta:
indexes = [
HnswIndex(fields=["embedding"], m=16, ef_construction=64, opclasses=["vector_cosine_ops"]),
models.Index(fields=["document", "chunk_index"]),
]Each chunk stores its text content, a 1536-dimensional embedding vector, and metadata for citations (page number, section heading).
Supporting Models
| Model | Purpose |
|---|---|
KnowledgeBaseDocument | Through table for KB ↔ Document M2M |
KnowledgeBaseTeamAccess | Team-level read/write access to KBs |
ChatSessionKnowledgeBase | Links chat sessions to KBs for RAG search |
DocumentProcessingTask | Tracks Celery task status for ingestion |
Processing Pipeline
Document created (pending)
│
▼
Celery task dispatched
│
├─ File → Download from S3 → Parse (Docling/PyMuPDF) → Chunk → Embed → Save
├─ Web URL → Extract (Trafilatura) → Chunk → Embed → Save
└─ YouTube → Extract transcript → Chunk → Embed → Save
│
▼
Document status → "ready"Retrieval Flow
User query in chat
│
▼
Resolve active KBs for session
│
▼
Get documents with status="ready"
│
├─ Semantic search (pgvector cosine similarity) → top 20
├─ Keyword search (PostgreSQL full-text) → top 20
│
▼
Reciprocal Rank Fusion → top 5 results
│
▼
Return to agent as tool resultConfiguration
| Setting | Default | Description |
|---|---|---|
DOCUMENT_EMBEDDING_MODEL | text-embedding-3-small | OpenAI embedding model |
DOCUMENT_EMBEDDING_DIMENSIONS | 1536 | Vector dimensions |
DOCUMENT_CHUNK_MAX_TOKENS | 512 | Max tokens per chunk |
DOCUMENT_MAX_UPLOAD_SIZE | 50MB | Max file upload size |
DOCUMENT_ALLOWED_EXTENSIONS | pdf,docx,doc,txt,csv,md,pptx,xlsx | Allowed file types |
DOCUMENT_S3_PREFIX | documents | S3 key prefix |
DOCUMENT_PRESIGNED_URL_EXPIRY | 3600 | Presigned URL TTL (seconds) |
Permissions
| Action | Who Can Access |
|---|---|
| Read document | Uploader, or workspace member (if KB is shared) |
| Write/delete document | Uploader, workspace admin, or workspace owner |
| Access KB | Workspace-shared, or team with KnowledgeBaseTeamAccess |