Prompt Caching
Tensoras automatically caches prompt prefixes, reducing latency for requests that share the same leading tokens. This is especially beneficial for applications that reuse system prompts, few-shot examples, or long context prefixes across many requests.
How It Works
Tensoras uses a radix-tree-based KV cache that stores computed attention states for token prefixes. When a new request arrives, the system checks whether any prefix of the input has already been computed:
- Cache hit: Computation is skipped for the cached prefix and only the new tokens are processed. This reduces time-to-first-token (TTFT) significantly.
- Cache miss: The full prompt is processed normally, and the prefix is cached for future requests.
The cache operates at the token level and is shared across all requests to the same model. Longer shared prefixes yield larger latency savings.
Benefits
- Reduced latency — Time-to-first-token drops substantially when a prefix is cached. For a 2,000-token system prompt, expect 50-80% TTFT reduction on cache hits.
- Higher throughput — By reusing computation, the system can handle more concurrent requests.
- Lower cost — Cached prefix tokens are billed at a 90% discount (see Billing below).
No API Changes Required
Prompt caching is fully automatic. There are no parameters to set, no headers to pass, and no SDK changes to make. If your requests share a common prefix, you benefit from caching transparently.
Optimizing for Cache Hits
To maximize prompt cache hit rates, follow these guidelines:
- Put static content first. Place your system prompt, instructions, and few-shot examples at the beginning of the message list. The cache matches on token prefixes, so content at the start of the prompt has the highest reuse.
- Keep prefixes stable across requests. Avoid inserting dynamic content (timestamps, random IDs, etc.) into the system prompt. Even a single changed token invalidates the cache from that point onward.
- Use the same model for related requests. Prefix caches are per-model. If you split traffic across models, each model maintains a separate cache.
- Batch similar requests together. Requests with the same prefix arriving in close succession are more likely to find the prefix warm in the cache.
Python
from tensoras import Tensoras
client = Tensoras(api_key="tns_your_key_here")
# Both of these requests share the same system prompt prefix.
# The second request will benefit from prompt caching automatically.
system_prompt = (
"You are a senior software engineer specializing in distributed systems. "
"You provide detailed, accurate technical answers with code examples. "
"Always consider edge cases, error handling, and performance implications."
)
# Request 1
response_1 = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Explain consistent hashing."},
],
)
# Request 2 -- same system prompt, different user message.
# The system prompt prefix is already cached, so TTFT is lower.
response_2 = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "How does Raft consensus work?"},
],
)
# Check cache status from the response
usage = response_2.usage
if usage.prompt_tokens_details and usage.prompt_tokens_details.cached_tokens > 0:
print(f"Cache hit! {usage.prompt_tokens_details.cached_tokens} tokens served from cache")Node.js
import Tensoras from "tensoras";
const client = new Tensoras({ apiKey: "tns_your_key_here" });
const systemPrompt =
"You are a senior software engineer specializing in distributed systems. " +
"You provide detailed, accurate technical answers with code examples. " +
"Always consider edge cases, error handling, and performance implications.";
// Request 1
const response1 = await client.chat.completions.create({
model: "llama-3.3-70b",
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: "Explain consistent hashing." },
],
});
// Request 2 -- same system prompt, TTFT is lower due to prefix caching
const response2 = await client.chat.completions.create({
model: "llama-3.3-70b",
messages: [
{ role: "system", content: systemPrompt },
{ role: "user", content: "How does Raft consensus work?" },
],
});
// Check cache status
const cachedTokens = response2.usage?.prompt_tokens_details?.cached_tokens ?? 0;
if (cachedTokens > 0) {
console.log(`Cache hit! ${cachedTokens} tokens served from cache`);
}What Gets Cached
The cache works on token prefixes — the longest leading sequence of tokens that matches a previously seen request. This means:
- System prompts are the most common beneficiary. If every request to your application starts with the same system prompt, all requests after the first get a cache hit on that prefix.
- Few-shot examples appended after the system prompt are also cached, as long as they appear in the same order.
- Multi-turn conversations benefit incrementally — each new turn extends the prefix, and the entire conversation history up to that point can be cached.
When Caching Helps Most
| Scenario | Benefit |
|---|---|
| Long system prompts reused across many requests | High — large prefix reuse |
| Multi-turn conversations with growing history | Medium — incremental caching per turn |
| Few-shot prompts with the same examples | High — examples become part of the cached prefix |
| Every request has a completely unique prompt | Low — no prefix overlap to cache |
API Response Fields
When a request benefits from prompt caching, you can see the details in two places:
Usage Object
The usage field in the response includes prompt_tokens_details.cached_tokens:
{
"usage": {
"prompt_tokens": 1024,
"completion_tokens": 256,
"total_tokens": 1280,
"prompt_tokens_details": {
"cached_tokens": 800
}
}
}In this example, 800 of the 1,024 prompt tokens were served from the prefix cache.
X-Cache-Status Header
The response includes an X-Cache-Status header:
HIT— at least some prompt tokens were served from the prefix cache.MISS— no prefix cache match; the full prompt was computed from scratch.
Billing
Cached prompt tokens are billed at a 90% discount. Only the non-cached portion of the prompt is billed at the full input token rate.
For example, with llama-3.3-70b ($0.20 per million input tokens):
| Tokens | Rate | Cost |
|---|---|---|
| 200 non-cached input tokens | $0.20 / M | $0.00004 |
| 800 cached input tokens | $0.02 / M (90% discount) | $0.000016 |
| 256 output tokens | $0.60 / M | $0.000154 |
| Total | $0.00021 |
Without caching, the same request would cost $0.000354, so caching saves approximately 40% on this request.
Cache Lifetime
The prefix cache is maintained in GPU memory and is managed automatically. Frequently accessed prefixes stay warm; infrequently accessed prefixes are evicted to make room for new ones. There is no manual cache management, TTL configuration, or eviction API.
Admin Cache Statistics
Platform administrators can monitor prompt cache performance using the admin stats endpoint. This requires an API key with the admin scope.
GET /v1/admin/cache/stats
Returns aggregate prompt cache statistics:
curl https://api.tensoras.ai/v1/admin/cache/stats \
-H "Authorization: Bearer tns_admin_key_here"Response:
{
"hit_count": 15234,
"miss_count": 4812,
"hit_rate": 0.7601,
"cached_tokens_total": 28450000,
"memory_usage_mb": null,
"entries": 1024,
"evictions": 87,
"uptime_seconds": 86400
}| Field | Description |
|---|---|
hit_count | Total number of requests that matched a cached prefix |
miss_count | Total number of requests with no prefix match |
hit_rate | hit_count / (hit_count + miss_count) |
cached_tokens_total | Total tokens served from the cache across all requests |
memory_usage_mb | Estimated GPU memory used by the prefix cache (when available) |
entries | Number of distinct cached prefixes |
evictions | Number of prefixes evicted from the cache |
uptime_seconds | Seconds since the cache stats were last reset |
POST /v1/admin/cache/reset
Reset all statistics counters to zero:
curl -X POST https://api.tensoras.ai/v1/admin/cache/reset \
-H "Authorization: Bearer tns_admin_key_here"Monitoring Cache Performance
Track these metrics to understand your caching efficiency:
- Hit rate — A hit rate above 70% is typical for applications with stable system prompts. If your hit rate is low, check whether your prompts have dynamic content at the beginning.
- Cached tokens — Shows the total volume of computation saved. Higher is better.
- Evictions — A high eviction count relative to entries may indicate that your working set exceeds available GPU memory. Consider reducing the number of distinct prefixes or upgrading GPU resources.
Related
- Streaming — prompt caching reduces TTFT for streamed responses too
- Chat Completions API — full endpoint reference
- RAG Overview — RAG queries with system prompts also benefit from caching
- Billing — pricing details and cost management