Prompt Caching

Tensoras automatically caches prompt prefixes, reducing latency for requests that share the same leading tokens. This is especially beneficial for applications that reuse system prompts, few-shot examples, or long context prefixes across many requests.

How It Works

Tensoras uses a radix-tree-based KV cache that stores computed attention states for token prefixes. When a new request arrives, the system checks whether any prefix of the input has already been computed:

Cache hit: Computation is skipped for the cached prefix and only the new tokens are processed. This reduces time-to-first-token (TTFT) significantly.
Cache miss: The full prompt is processed normally, and the prefix is cached for future requests.

The cache operates at the token level and is shared across all requests to the same model. Longer shared prefixes yield larger latency savings.

Benefits

Reduced latency — Time-to-first-token drops substantially when a prefix is cached. For a 2,000-token system prompt, expect 50-80% TTFT reduction on cache hits.
Higher throughput — By reusing computation, the system can handle more concurrent requests.
Lower cost — Cached prefix tokens are billed at a 90% discount (see Billing below).

No API Changes Required

Prompt caching is fully automatic. There are no parameters to set, no headers to pass, and no SDK changes to make. If your requests share a common prefix, you benefit from caching transparently.

Optimizing for Cache Hits

To maximize prompt cache hit rates, follow these guidelines:

Put static content first. Place your system prompt, instructions, and few-shot examples at the beginning of the message list. The cache matches on token prefixes, so content at the start of the prompt has the highest reuse.
Keep prefixes stable across requests. Avoid inserting dynamic content (timestamps, random IDs, etc.) into the system prompt. Even a single changed token invalidates the cache from that point onward.
Use the same model for related requests. Prefix caches are per-model. If you split traffic across models, each model maintains a separate cache.
Batch similar requests together. Requests with the same prefix arriving in close succession are more likely to find the prefix warm in the cache.

Python

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
# Both of these requests share the same system prompt prefix.
# The second request will benefit from prompt caching automatically.
 
system_prompt = (
    "You are a senior software engineer specializing in distributed systems. "
    "You provide detailed, accurate technical answers with code examples. "
    "Always consider edge cases, error handling, and performance implications."
)
 
# Request 1
response_1 = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Explain consistent hashing."},
    ],
)
 
# Request 2 -- same system prompt, different user message.
# The system prompt prefix is already cached, so TTFT is lower.
response_2 = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "How does Raft consensus work?"},
    ],
)
 
# Check cache status from the response
usage = response_2.usage
if usage.prompt_tokens_details and usage.prompt_tokens_details.cached_tokens > 0:
    print(f"Cache hit! {usage.prompt_tokens_details.cached_tokens} tokens served from cache")

Node.js

import Tensoras from "tensoras";
 
const client = new Tensoras({ apiKey: "tns_your_key_here" });
 
const systemPrompt =
  "You are a senior software engineer specializing in distributed systems. " +
  "You provide detailed, accurate technical answers with code examples. " +
  "Always consider edge cases, error handling, and performance implications.";
 
// Request 1
const response1 = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: "Explain consistent hashing." },
  ],
});
 
// Request 2 -- same system prompt, TTFT is lower due to prefix caching
const response2 = await client.chat.completions.create({
  model: "llama-3.3-70b",
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: "How does Raft consensus work?" },
  ],
});
 
// Check cache status
const cachedTokens = response2.usage?.prompt_tokens_details?.cached_tokens ?? 0;
if (cachedTokens > 0) {
  console.log(`Cache hit! ${cachedTokens} tokens served from cache`);
}

What Gets Cached

The cache works on token prefixes — the longest leading sequence of tokens that matches a previously seen request. This means:

System prompts are the most common beneficiary. If every request to your application starts with the same system prompt, all requests after the first get a cache hit on that prefix.
Few-shot examples appended after the system prompt are also cached, as long as they appear in the same order.
Multi-turn conversations benefit incrementally — each new turn extends the prefix, and the entire conversation history up to that point can be cached.

When Caching Helps Most

Scenario	Benefit
Long system prompts reused across many requests	High — large prefix reuse
Multi-turn conversations with growing history	Medium — incremental caching per turn
Few-shot prompts with the same examples	High — examples become part of the cached prefix
Every request has a completely unique prompt	Low — no prefix overlap to cache

API Response Fields

When a request benefits from prompt caching, you can see the details in two places:

Usage Object

The usage field in the response includes prompt_tokens_details.cached_tokens:

{
  "usage": {
    "prompt_tokens": 1024,
    "completion_tokens": 256,
    "total_tokens": 1280,
    "prompt_tokens_details": {
      "cached_tokens": 800
    }
  }
}

In this example, 800 of the 1,024 prompt tokens were served from the prefix cache.

X-Cache-Status Header

The response includes an X-Cache-Status header:

HIT — at least some prompt tokens were served from the prefix cache.
MISS — no prefix cache match; the full prompt was computed from scratch.

Billing

Cached prompt tokens are billed at a 90% discount. Only the non-cached portion of the prompt is billed at the full input token rate.

For example, with llama-3.3-70b ($0.20 per million input tokens):

Tokens	Rate	Cost
200 non-cached input tokens	$0.20 / M	$0.00004
800 cached input tokens	$0.02 / M (90% discount)	$0.000016
256 output tokens	$0.60 / M	$0.000154
Total		$0.00021

Without caching, the same request would cost $0.000354, so caching saves approximately 40% on this request.

Cache Lifetime

The prefix cache is maintained in GPU memory and is managed automatically. Frequently accessed prefixes stay warm; infrequently accessed prefixes are evicted to make room for new ones. There is no manual cache management, TTL configuration, or eviction API.

Admin Cache Statistics

Platform administrators can monitor prompt cache performance using the admin stats endpoint. This requires an API key with the admin scope.

GET /v1/admin/cache/stats

Returns aggregate prompt cache statistics:

curl https://api.tensoras.ai/v1/admin/cache/stats \
  -H "Authorization: Bearer tns_admin_key_here"

Response:

{
  "hit_count": 15234,
  "miss_count": 4812,
  "hit_rate": 0.7601,
  "cached_tokens_total": 28450000,
  "memory_usage_mb": null,
  "entries": 1024,
  "evictions": 87,
  "uptime_seconds": 86400
}

Field	Description
`hit_count`	Total number of requests that matched a cached prefix
`miss_count`	Total number of requests with no prefix match
`hit_rate`	`hit_count / (hit_count + miss_count)`
`cached_tokens_total`	Total tokens served from the cache across all requests
`memory_usage_mb`	Estimated GPU memory used by the prefix cache (when available)
`entries`	Number of distinct cached prefixes
`evictions`	Number of prefixes evicted from the cache
`uptime_seconds`	Seconds since the cache stats were last reset

POST /v1/admin/cache/reset

Reset all statistics counters to zero:

curl -X POST https://api.tensoras.ai/v1/admin/cache/reset \
  -H "Authorization: Bearer tns_admin_key_here"

Monitoring Cache Performance

Track these metrics to understand your caching efficiency:

Hit rate — A hit rate above 70% is typical for applications with stable system prompts. If your hit rate is low, check whether your prompts have dynamic content at the beginning.
Cached tokens — Shows the total volume of computation saved. Higher is better.
Evictions — A high eviction count relative to entries may indicate that your working set exceeds available GPU memory. Consider reducing the number of distinct prefixes or upgrading GPU resources.

Streaming — prompt caching reduces TTFT for streamed responses too
Chat Completions API — full endpoint reference
RAG Overview — RAG queries with system prompts also benefit from caching
Billing — pricing details and cost management

Reasoning Vision & Multimodal