RAG Best Practices

This guide covers practical recommendations for building high-quality retrieval-augmented generation (RAG) systems with Tensoras Knowledge Bases. Each section addresses a specific aspect of the RAG pipeline with actionable guidance.

Chunking Strategy

How you split documents into chunks directly affects retrieval quality. Chunks that are too large dilute the relevant information; chunks that are too small lose context.

Recommended Settings

Chunk size: 256—512 tokens per chunk is a good default for most use cases.
Overlap: 10—15% overlap (25—50 tokens) helps preserve context across chunk boundaries.
Strategy: Use semantic chunking when available. It splits documents at natural boundaries (paragraphs, sections) rather than at fixed token counts.

Configure Chunking

from tensoras import Tensoras
 
client = Tensoras()
 
kb = client.knowledge_bases.create(
    name="product-docs",
    description="Product documentation",
    chunking={
        "strategy": "semantic",     # "semantic", "fixed", or "sentence"
        "chunk_size": 400,          # target tokens per chunk
        "chunk_overlap": 40,        # overlap tokens between adjacent chunks
    },
)

import Tensoras from "tensoras";
 
const client = new Tensoras();
 
const kb = await client.knowledgeBases.create({
  name: "product-docs",
  description: "Product documentation",
  chunking: {
    strategy: "semantic",     // "semantic", "fixed", or "sentence"
    chunkSize: 400,           // target tokens per chunk
    chunkOverlap: 40,         // overlap tokens between adjacent chunks
  },
});

Chunking Strategies Compared

Strategy	How It Works	Best For
`semantic`	Splits at paragraph and section boundaries, respecting document structure	Long-form documents, technical docs, articles
`fixed`	Splits at a fixed token count with overlap	Uniform content, logs, raw text
`sentence`	Splits at sentence boundaries, grouping sentences to reach the target size	FAQ pages, short-form content

See the Chunking Strategies reference for full details.

Embedding Model Selection

Tensoras uses bge-large-en-v1.5 as the default embedding model. It provides a good balance of quality and speed for English-language content.

Model	Dimensions	Strengths
`bge-large-en-v1.5`	1024	Strong general-purpose quality, fast inference

The embedding model is set at Knowledge Base creation time. All documents in a Knowledge Base are embedded with the same model to ensure consistent vector similarity scoring.

kb = client.knowledge_bases.create(
    name="product-docs",
    embedding_model="bge-large-en-v1.5",  # default
)

Tip: If you need to change the embedding model, create a new Knowledge Base and re-ingest your documents. Mixing embeddings from different models in the same index produces poor retrieval results.

Hybrid Search

Tensoras Knowledge Bases support hybrid search, which combines vector similarity search with keyword (BM25) search. Hybrid search is enabled by default and is recommended for most use cases.

Why Hybrid Search Matters

Vector search excels at finding semantically similar content, even when the exact words differ.
Keyword search excels at matching specific terms, names, acronyms, and identifiers that embeddings may not capture well.
Hybrid search combines both, giving you the best recall across a wider range of queries.

Configure Search Mode

You can override the search mode per query:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is error code E-4012?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "search_mode": "hybrid",   # "hybrid", "vector", or "keyword"
        "top_k": 10,               # number of chunks to retrieve
    },
)

const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "What is error code E-4012?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    searchMode: "hybrid",   // "hybrid", "vector", or "keyword"
    topK: 10,               // number of chunks to retrieve
  },
});

For queries that contain specific identifiers (error codes, product SKUs, account numbers), keyword search adds significant value. For open-ended questions, vector search dominates. Hybrid gives you both.

Reranking

After the initial retrieval step returns candidate chunks, a reranking model re-scores them for relevance. This improves the precision of the final results by pushing the most relevant chunks to the top.

Enable Reranking

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "How do I configure SSO?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "top_k": 20,                    # retrieve 20 candidates
        "rerank": True,                  # re-score with reranker
        "rerank_model": "bge-reranker-v2-m3",
        "rerank_top_k": 5,              # keep top 5 after reranking
    },
)

const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "How do I configure SSO?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    topK: 20,                    // retrieve 20 candidates
    rerank: true,                 // re-score with reranker
    rerankModel: "bge-reranker-v2-m3",
    rerankTopK: 5,               // keep top 5 after reranking
  },
});

When to Use Reranking

Recommended when precision matters more than latency (e.g., customer-facing Q&A, support bots).
Skip when you need the lowest possible latency and are comfortable with vector-only retrieval (e.g., internal search autocomplete).

Reranking adds a small amount of latency (typically 50—150ms) but significantly improves the quality of the top results. The cost is $0.02 per million tokens.

Metadata Filtering

Add metadata to your documents to enable filtered retrieval. This is useful when your Knowledge Base contains documents from multiple categories, teams, or time periods.

Add Metadata During Ingestion

Metadata can come from:

File uploads: Set metadata explicitly when creating the data source
S3 connector: Use folder prefixes or file naming conventions parsed during ingestion
Database connector: Map columns to metadata fields using metadata_columns

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id=kb.id,
    type="file_upload",
    file=open("engineering-handbook.pdf", "rb"),
    metadata={
        "department": "engineering",
        "doc_type": "handbook",
        "version": "2.1",
    },
)

Filter at Query Time

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is our code review process?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "filter": {
            "department": "engineering",
        },
    },
)

const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "What is our code review process?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    filter: {
      department: "engineering",
    },
  },
});

Metadata filtering narrows the search space before vector similarity is computed, which improves both relevance and speed.

Prompt Engineering for RAG

The system prompt plays a critical role in how the model uses retrieved context. A well-crafted system prompt reduces hallucination and improves answer quality.

Recommended System Prompt Template

system_prompt = """You are a helpful assistant that answers questions based on the provided context.
 
Instructions:
- Answer the user's question using ONLY the information from the retrieved context.
- If the context does not contain enough information to answer the question, say "I don't have enough information to answer that question."
- Do not make up information that is not in the context.
- When possible, cite the source document.
- Be concise and direct."""
 
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is our PTO policy?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
)

Tips

Be explicit about grounding: Tell the model to use “only the provided context.”
Handle missing information: Instruct the model what to say when the context does not contain the answer.
Request citations: Ask the model to reference source documents so users can verify.
Set the tone: Match the system prompt to your application (formal for enterprise, conversational for consumer).

Evaluation and Monitoring

Track retrieval quality to identify issues and improve your RAG pipeline over time.

Relevance Scores

Every retrieval result includes a relevance score (0.0 to 1.0). Monitor these scores to detect degradation:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "How do I reset my password?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
)
 
for citation in response.citations:
    print(f"Source: {citation.source}, Score: {citation.score:.3f}")

Quality Indicators

Signal	Healthy Range	Action if Outside Range
Top-1 relevance score	> 0.80	Review chunking strategy, check for missing documents
Top-5 average score	> 0.60	Try hybrid search, add reranking
User thumbs-down rate	< 10%	Examine low-rated responses, improve system prompt
”I don’t know” response rate	< 20%	Check for missing content in the Knowledge Base

Standalone Retrieval Testing

Use the retrieval endpoint directly to test search quality without running a full chat completion:

results = client.retrieval.query(
    knowledge_base_id="kb_a1b2c3d4",
    query="password reset",
    top_k=5,
)
 
for result in results.chunks:
    print(f"Score: {result.score:.3f} | {result.text[:100]}...")

This lets you iterate on chunking, search mode, and metadata filters without spending tokens on generation.

Scaling Large Knowledge Bases

For Knowledge Bases with hundreds of thousands or millions of documents:

Partition by Topic or Domain

Instead of one massive Knowledge Base, create multiple focused Knowledge Bases and route queries to the relevant one:

# Route based on user intent or metadata
if user_department == "engineering":
    kb_ids = ["kb_engineering"]
elif user_department == "sales":
    kb_ids = ["kb_sales"]
else:
    kb_ids = ["kb_engineering", "kb_sales"]  # search both
 
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": user_query}],
    knowledge_bases=kb_ids,
)

Use Metadata Filters to Narrow Search

Metadata filters reduce the search space, which improves both latency and relevance at scale. Always add metadata (department, document type, date range) when ingesting documents.

Monitor Ingestion Performance

For large-scale ingestion:

Use the S3 connector for batch document uploads
Monitor ingestion jobs via the Ingestion Jobs API
Stagger syncs across data sources to avoid overloading ingestion workers

Quick Reference

Aspect	Recommendation
Chunk size	256—512 tokens
Chunk overlap	10—15%
Chunking strategy	Semantic (default)
Embedding model	`bge-large-en-v1.5`
Search mode	Hybrid (default)
Reranking	Enable for precision-critical use cases
Retrieval top_k	10—20, then rerank to top 5
Metadata	Always add; use for filtering
System prompt	Explicitly instruct grounding and citation behavior

Next Steps

RAG Overview — how Tensoras RAG works end to end
Chunking Strategies — detailed chunking configuration
Hybrid Search — vector + keyword search deep dive
Citations — source attribution in responses
S3 Connector — ingest documents from S3
Database Connector — ingest data from PostgreSQL or MySQL

Database Connector API Explorer