RAG Best Practices
This guide covers practical recommendations for building high-quality retrieval-augmented generation (RAG) systems with Tensoras Knowledge Bases. Each section addresses a specific aspect of the RAG pipeline with actionable guidance.
Chunking Strategy
How you split documents into chunks directly affects retrieval quality. Chunks that are too large dilute the relevant information; chunks that are too small lose context.
Recommended Settings
- Chunk size: 256—512 tokens per chunk is a good default for most use cases.
- Overlap: 10—15% overlap (25—50 tokens) helps preserve context across chunk boundaries.
- Strategy: Use semantic chunking when available. It splits documents at natural boundaries (paragraphs, sections) rather than at fixed token counts.
Configure Chunking
from tensoras import Tensoras
client = Tensoras()
kb = client.knowledge_bases.create(
name="product-docs",
description="Product documentation",
chunking={
"strategy": "semantic", # "semantic", "fixed", or "sentence"
"chunk_size": 400, # target tokens per chunk
"chunk_overlap": 40, # overlap tokens between adjacent chunks
},
)import Tensoras from "tensoras";
const client = new Tensoras();
const kb = await client.knowledgeBases.create({
name: "product-docs",
description: "Product documentation",
chunking: {
strategy: "semantic", // "semantic", "fixed", or "sentence"
chunkSize: 400, // target tokens per chunk
chunkOverlap: 40, // overlap tokens between adjacent chunks
},
});Chunking Strategies Compared
| Strategy | How It Works | Best For |
|---|---|---|
semantic | Splits at paragraph and section boundaries, respecting document structure | Long-form documents, technical docs, articles |
fixed | Splits at a fixed token count with overlap | Uniform content, logs, raw text |
sentence | Splits at sentence boundaries, grouping sentences to reach the target size | FAQ pages, short-form content |
See the Chunking Strategies reference for full details.
Embedding Model Selection
Tensoras uses bge-large-en-v1.5 as the default embedding model. It provides a good balance of quality and speed for English-language content.
| Model | Dimensions | Strengths |
|---|---|---|
bge-large-en-v1.5 | 1024 | Strong general-purpose quality, fast inference |
The embedding model is set at Knowledge Base creation time. All documents in a Knowledge Base are embedded with the same model to ensure consistent vector similarity scoring.
kb = client.knowledge_bases.create(
name="product-docs",
embedding_model="bge-large-en-v1.5", # default
)Tip: If you need to change the embedding model, create a new Knowledge Base and re-ingest your documents. Mixing embeddings from different models in the same index produces poor retrieval results.
Hybrid Search
Tensoras Knowledge Bases support hybrid search, which combines vector similarity search with keyword (BM25) search. Hybrid search is enabled by default and is recommended for most use cases.
Why Hybrid Search Matters
- Vector search excels at finding semantically similar content, even when the exact words differ.
- Keyword search excels at matching specific terms, names, acronyms, and identifiers that embeddings may not capture well.
- Hybrid search combines both, giving you the best recall across a wider range of queries.
Configure Search Mode
You can override the search mode per query:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "What is error code E-4012?"},
],
knowledge_bases=["kb_a1b2c3d4"],
retrieval={
"search_mode": "hybrid", # "hybrid", "vector", or "keyword"
"top_k": 10, # number of chunks to retrieve
},
)const response = await client.chat.completions.create({
model: "llama-3.3-70b",
messages: [
{ role: "user", content: "What is error code E-4012?" },
],
knowledgeBases: ["kb_a1b2c3d4"],
retrieval: {
searchMode: "hybrid", // "hybrid", "vector", or "keyword"
topK: 10, // number of chunks to retrieve
},
});For queries that contain specific identifiers (error codes, product SKUs, account numbers), keyword search adds significant value. For open-ended questions, vector search dominates. Hybrid gives you both.
Reranking
After the initial retrieval step returns candidate chunks, a reranking model re-scores them for relevance. This improves the precision of the final results by pushing the most relevant chunks to the top.
Enable Reranking
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "How do I configure SSO?"},
],
knowledge_bases=["kb_a1b2c3d4"],
retrieval={
"top_k": 20, # retrieve 20 candidates
"rerank": True, # re-score with reranker
"rerank_model": "bge-reranker-v2-m3",
"rerank_top_k": 5, # keep top 5 after reranking
},
)const response = await client.chat.completions.create({
model: "llama-3.3-70b",
messages: [
{ role: "user", content: "How do I configure SSO?" },
],
knowledgeBases: ["kb_a1b2c3d4"],
retrieval: {
topK: 20, // retrieve 20 candidates
rerank: true, // re-score with reranker
rerankModel: "bge-reranker-v2-m3",
rerankTopK: 5, // keep top 5 after reranking
},
});When to Use Reranking
- Recommended when precision matters more than latency (e.g., customer-facing Q&A, support bots).
- Skip when you need the lowest possible latency and are comfortable with vector-only retrieval (e.g., internal search autocomplete).
Reranking adds a small amount of latency (typically 50—150ms) but significantly improves the quality of the top results. The cost is $0.02 per million tokens.
Metadata Filtering
Add metadata to your documents to enable filtered retrieval. This is useful when your Knowledge Base contains documents from multiple categories, teams, or time periods.
Add Metadata During Ingestion
Metadata can come from:
- File uploads: Set metadata explicitly when creating the data source
- S3 connector: Use folder prefixes or file naming conventions parsed during ingestion
- Database connector: Map columns to metadata fields using
metadata_columns
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id=kb.id,
type="file_upload",
file=open("engineering-handbook.pdf", "rb"),
metadata={
"department": "engineering",
"doc_type": "handbook",
"version": "2.1",
},
)Filter at Query Time
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "What is our code review process?"},
],
knowledge_bases=["kb_a1b2c3d4"],
retrieval={
"filter": {
"department": "engineering",
},
},
)const response = await client.chat.completions.create({
model: "llama-3.3-70b",
messages: [
{ role: "user", content: "What is our code review process?" },
],
knowledgeBases: ["kb_a1b2c3d4"],
retrieval: {
filter: {
department: "engineering",
},
},
});Metadata filtering narrows the search space before vector similarity is computed, which improves both relevance and speed.
Prompt Engineering for RAG
The system prompt plays a critical role in how the model uses retrieved context. A well-crafted system prompt reduces hallucination and improves answer quality.
Recommended System Prompt Template
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Instructions:
- Answer the user's question using ONLY the information from the retrieved context.
- If the context does not contain enough information to answer the question, say "I don't have enough information to answer that question."
- Do not make up information that is not in the context.
- When possible, cite the source document.
- Be concise and direct."""
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What is our PTO policy?"},
],
knowledge_bases=["kb_a1b2c3d4"],
)Tips
- Be explicit about grounding: Tell the model to use “only the provided context.”
- Handle missing information: Instruct the model what to say when the context does not contain the answer.
- Request citations: Ask the model to reference source documents so users can verify.
- Set the tone: Match the system prompt to your application (formal for enterprise, conversational for consumer).
Evaluation and Monitoring
Track retrieval quality to identify issues and improve your RAG pipeline over time.
Relevance Scores
Every retrieval result includes a relevance score (0.0 to 1.0). Monitor these scores to detect degradation:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "How do I reset my password?"},
],
knowledge_bases=["kb_a1b2c3d4"],
)
for citation in response.citations:
print(f"Source: {citation.source}, Score: {citation.score:.3f}")Quality Indicators
| Signal | Healthy Range | Action if Outside Range |
|---|---|---|
| Top-1 relevance score | > 0.80 | Review chunking strategy, check for missing documents |
| Top-5 average score | > 0.60 | Try hybrid search, add reranking |
| User thumbs-down rate | < 10% | Examine low-rated responses, improve system prompt |
| ”I don’t know” response rate | < 20% | Check for missing content in the Knowledge Base |
Standalone Retrieval Testing
Use the retrieval endpoint directly to test search quality without running a full chat completion:
results = client.retrieval.query(
knowledge_base_id="kb_a1b2c3d4",
query="password reset",
top_k=5,
)
for result in results.chunks:
print(f"Score: {result.score:.3f} | {result.text[:100]}...")This lets you iterate on chunking, search mode, and metadata filters without spending tokens on generation.
Scaling Large Knowledge Bases
For Knowledge Bases with hundreds of thousands or millions of documents:
Partition by Topic or Domain
Instead of one massive Knowledge Base, create multiple focused Knowledge Bases and route queries to the relevant one:
# Route based on user intent or metadata
if user_department == "engineering":
kb_ids = ["kb_engineering"]
elif user_department == "sales":
kb_ids = ["kb_sales"]
else:
kb_ids = ["kb_engineering", "kb_sales"] # search both
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": user_query}],
knowledge_bases=kb_ids,
)Use Metadata Filters to Narrow Search
Metadata filters reduce the search space, which improves both latency and relevance at scale. Always add metadata (department, document type, date range) when ingesting documents.
Monitor Ingestion Performance
For large-scale ingestion:
- Use the S3 connector for batch document uploads
- Monitor ingestion jobs via the Ingestion Jobs API
- Stagger syncs across data sources to avoid overloading ingestion workers
Quick Reference
| Aspect | Recommendation |
|---|---|
| Chunk size | 256—512 tokens |
| Chunk overlap | 10—15% |
| Chunking strategy | Semantic (default) |
| Embedding model | bge-large-en-v1.5 |
| Search mode | Hybrid (default) |
| Reranking | Enable for precision-critical use cases |
| Retrieval top_k | 10—20, then rerank to top 5 |
| Metadata | Always add; use for filtering |
| System prompt | Explicitly instruct grounding and citation behavior |
Next Steps
- RAG Overview — how Tensoras RAG works end to end
- Chunking Strategies — detailed chunking configuration
- Hybrid Search — vector + keyword search deep dive
- Citations — source attribution in responses
- S3 Connector — ingest documents from S3
- Database Connector — ingest data from PostgreSQL or MySQL