GuidesRAG Best Practices

RAG Best Practices

This guide covers practical recommendations for building high-quality retrieval-augmented generation (RAG) systems with Tensoras Knowledge Bases. Each section addresses a specific aspect of the RAG pipeline with actionable guidance.

Chunking Strategy

How you split documents into chunks directly affects retrieval quality. Chunks that are too large dilute the relevant information; chunks that are too small lose context.

  • Chunk size: 256—512 tokens per chunk is a good default for most use cases.
  • Overlap: 10—15% overlap (25—50 tokens) helps preserve context across chunk boundaries.
  • Strategy: Use semantic chunking when available. It splits documents at natural boundaries (paragraphs, sections) rather than at fixed token counts.

Configure Chunking

from tensoras import Tensoras
 
client = Tensoras()
 
kb = client.knowledge_bases.create(
    name="product-docs",
    description="Product documentation",
    chunking={
        "strategy": "semantic",     # "semantic", "fixed", or "sentence"
        "chunk_size": 400,          # target tokens per chunk
        "chunk_overlap": 40,        # overlap tokens between adjacent chunks
    },
)
import Tensoras from "tensoras";
 
const client = new Tensoras();
 
const kb = await client.knowledgeBases.create({
  name: "product-docs",
  description: "Product documentation",
  chunking: {
    strategy: "semantic",     // "semantic", "fixed", or "sentence"
    chunkSize: 400,           // target tokens per chunk
    chunkOverlap: 40,         // overlap tokens between adjacent chunks
  },
});

Chunking Strategies Compared

StrategyHow It WorksBest For
semanticSplits at paragraph and section boundaries, respecting document structureLong-form documents, technical docs, articles
fixedSplits at a fixed token count with overlapUniform content, logs, raw text
sentenceSplits at sentence boundaries, grouping sentences to reach the target sizeFAQ pages, short-form content

See the Chunking Strategies reference for full details.

Embedding Model Selection

Tensoras uses bge-large-en-v1.5 as the default embedding model. It provides a good balance of quality and speed for English-language content.

ModelDimensionsStrengths
bge-large-en-v1.51024Strong general-purpose quality, fast inference

The embedding model is set at Knowledge Base creation time. All documents in a Knowledge Base are embedded with the same model to ensure consistent vector similarity scoring.

kb = client.knowledge_bases.create(
    name="product-docs",
    embedding_model="bge-large-en-v1.5",  # default
)

Tip: If you need to change the embedding model, create a new Knowledge Base and re-ingest your documents. Mixing embeddings from different models in the same index produces poor retrieval results.

Tensoras Knowledge Bases support hybrid search, which combines vector similarity search with keyword (BM25) search. Hybrid search is enabled by default and is recommended for most use cases.

Why Hybrid Search Matters

  • Vector search excels at finding semantically similar content, even when the exact words differ.
  • Keyword search excels at matching specific terms, names, acronyms, and identifiers that embeddings may not capture well.
  • Hybrid search combines both, giving you the best recall across a wider range of queries.

Configure Search Mode

You can override the search mode per query:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is error code E-4012?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "search_mode": "hybrid",   # "hybrid", "vector", or "keyword"
        "top_k": 10,               # number of chunks to retrieve
    },
)
const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "What is error code E-4012?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    searchMode: "hybrid",   // "hybrid", "vector", or "keyword"
    topK: 10,               // number of chunks to retrieve
  },
});

For queries that contain specific identifiers (error codes, product SKUs, account numbers), keyword search adds significant value. For open-ended questions, vector search dominates. Hybrid gives you both.

Reranking

After the initial retrieval step returns candidate chunks, a reranking model re-scores them for relevance. This improves the precision of the final results by pushing the most relevant chunks to the top.

Enable Reranking

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "How do I configure SSO?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "top_k": 20,                    # retrieve 20 candidates
        "rerank": True,                  # re-score with reranker
        "rerank_model": "bge-reranker-v2-m3",
        "rerank_top_k": 5,              # keep top 5 after reranking
    },
)
const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "How do I configure SSO?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    topK: 20,                    // retrieve 20 candidates
    rerank: true,                 // re-score with reranker
    rerankModel: "bge-reranker-v2-m3",
    rerankTopK: 5,               // keep top 5 after reranking
  },
});

When to Use Reranking

  • Recommended when precision matters more than latency (e.g., customer-facing Q&A, support bots).
  • Skip when you need the lowest possible latency and are comfortable with vector-only retrieval (e.g., internal search autocomplete).

Reranking adds a small amount of latency (typically 50—150ms) but significantly improves the quality of the top results. The cost is $0.02 per million tokens.

Metadata Filtering

Add metadata to your documents to enable filtered retrieval. This is useful when your Knowledge Base contains documents from multiple categories, teams, or time periods.

Add Metadata During Ingestion

Metadata can come from:

  • File uploads: Set metadata explicitly when creating the data source
  • S3 connector: Use folder prefixes or file naming conventions parsed during ingestion
  • Database connector: Map columns to metadata fields using metadata_columns
data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id=kb.id,
    type="file_upload",
    file=open("engineering-handbook.pdf", "rb"),
    metadata={
        "department": "engineering",
        "doc_type": "handbook",
        "version": "2.1",
    },
)

Filter at Query Time

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is our code review process?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
    retrieval={
        "filter": {
            "department": "engineering",
        },
    },
)
const response = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "user", content: "What is our code review process?" },
  ],
  knowledgeBases: ["kb_a1b2c3d4"],
  retrieval: {
    filter: {
      department: "engineering",
    },
  },
});

Metadata filtering narrows the search space before vector similarity is computed, which improves both relevance and speed.

Prompt Engineering for RAG

The system prompt plays a critical role in how the model uses retrieved context. A well-crafted system prompt reduces hallucination and improves answer quality.

system_prompt = """You are a helpful assistant that answers questions based on the provided context.
 
Instructions:
- Answer the user's question using ONLY the information from the retrieved context.
- If the context does not contain enough information to answer the question, say "I don't have enough information to answer that question."
- Do not make up information that is not in the context.
- When possible, cite the source document.
- Be concise and direct."""
 
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What is our PTO policy?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
)

Tips

  • Be explicit about grounding: Tell the model to use “only the provided context.”
  • Handle missing information: Instruct the model what to say when the context does not contain the answer.
  • Request citations: Ask the model to reference source documents so users can verify.
  • Set the tone: Match the system prompt to your application (formal for enterprise, conversational for consumer).

Evaluation and Monitoring

Track retrieval quality to identify issues and improve your RAG pipeline over time.

Relevance Scores

Every retrieval result includes a relevance score (0.0 to 1.0). Monitor these scores to detect degradation:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "How do I reset my password?"},
    ],
    knowledge_bases=["kb_a1b2c3d4"],
)
 
for citation in response.citations:
    print(f"Source: {citation.source}, Score: {citation.score:.3f}")

Quality Indicators

SignalHealthy RangeAction if Outside Range
Top-1 relevance score> 0.80Review chunking strategy, check for missing documents
Top-5 average score> 0.60Try hybrid search, add reranking
User thumbs-down rate< 10%Examine low-rated responses, improve system prompt
”I don’t know” response rate< 20%Check for missing content in the Knowledge Base

Standalone Retrieval Testing

Use the retrieval endpoint directly to test search quality without running a full chat completion:

results = client.retrieval.query(
    knowledge_base_id="kb_a1b2c3d4",
    query="password reset",
    top_k=5,
)
 
for result in results.chunks:
    print(f"Score: {result.score:.3f} | {result.text[:100]}...")

This lets you iterate on chunking, search mode, and metadata filters without spending tokens on generation.

Scaling Large Knowledge Bases

For Knowledge Bases with hundreds of thousands or millions of documents:

Partition by Topic or Domain

Instead of one massive Knowledge Base, create multiple focused Knowledge Bases and route queries to the relevant one:

# Route based on user intent or metadata
if user_department == "engineering":
    kb_ids = ["kb_engineering"]
elif user_department == "sales":
    kb_ids = ["kb_sales"]
else:
    kb_ids = ["kb_engineering", "kb_sales"]  # search both
 
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": user_query}],
    knowledge_bases=kb_ids,
)

Metadata filters reduce the search space, which improves both latency and relevance at scale. Always add metadata (department, document type, date range) when ingesting documents.

Monitor Ingestion Performance

For large-scale ingestion:

  • Use the S3 connector for batch document uploads
  • Monitor ingestion jobs via the Ingestion Jobs API
  • Stagger syncs across data sources to avoid overloading ingestion workers

Quick Reference

AspectRecommendation
Chunk size256—512 tokens
Chunk overlap10—15%
Chunking strategySemantic (default)
Embedding modelbge-large-en-v1.5
Search modeHybrid (default)
RerankingEnable for precision-critical use cases
Retrieval top_k10—20, then rerank to top 5
MetadataAlways add; use for filtering
System promptExplicitly instruct grounding and citation behavior

Next Steps