Rate Limits
Rate limits protect the Tensoras.ai platform and ensure fair usage across all customers. This guide explains the limits for each plan tier, how to detect and handle limit errors, and best practices for staying within your allocation.
Limits by Plan
Rate limits are applied per API key and measured in two dimensions: requests per minute (RPM) and tokens per minute (TPM).
| Plan | RPM | TPM | Knowledge Bases | Storage |
|---|---|---|---|---|
| Free | 30 | 100K | 2 | 1 GB |
| Developer ($29/mo) | 600 | 1M | 5 | 5 GB |
| Pro ($49/mo) | 3,000 | 5M | 10 | 10 GB |
| Enterprise (custom) | 10,000 | Custom | Unlimited | Custom |
Both RPM and TPM limits are enforced per rolling one-minute window. If either limit is exceeded, the API returns a 429 Too Many Requests error.
Tip: If you consistently hit your plan’s limits, consider upgrading your plan or contacting sales for an Enterprise agreement with custom limits.
Rate Limit Headers
Every API response includes headers that report your current rate limit status:
| Header | Description |
|---|---|
X-RateLimit-Limit | Maximum requests allowed per minute for this key |
X-RateLimit-Remaining | Requests remaining in the current one-minute window |
X-RateLimit-Reset | Unix timestamp (seconds) when the current window resets |
These headers are present on both successful responses and 429 errors, so you can monitor your usage proactively.
curl -i https://api.tensoras.ai/v1/chat/completions \
-H "Authorization: Bearer $TENSORAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role": "user", "content": "Hello"}]
}'HTTP/2 200
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 598
X-RateLimit-Reset: 1708200120Handling 429 Errors
When you exceed your rate limit, the API returns a 429 response with a Retry-After header indicating how many seconds to wait:
{
"error": {
"message": "Rate limit exceeded. Please retry after 12 seconds.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}| Header | Description |
|---|---|
Retry-After | Number of seconds to wait before sending the next request |
Exponential Backoff
If you are writing your own retry logic, use exponential backoff with jitter:
import time
import random
import requests
def call_with_backoff(url, headers, payload, max_retries=5):
for attempt in range(max_retries):
response = requests.post(url, headers=headers, json=payload)
if response.status_code != 429:
return response
# Respect the Retry-After header if present
retry_after = int(response.headers.get("Retry-After", 1))
wait = retry_after + random.uniform(0, 1)
print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)
raise Exception("Max retries exceeded")async function callWithBackoff(url, options, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url, options);
if (response.status !== 429) {
return response;
}
const retryAfter = parseInt(response.headers.get("Retry-After") || "1", 10);
const wait = retryAfter + Math.random();
console.log(`Rate limited. Retrying in ${wait.toFixed(1)}s (attempt ${attempt + 1})`);
await new Promise((resolve) => setTimeout(resolve, wait * 1000));
}
throw new Error("Max retries exceeded");
}SDK Automatic Retry Behavior
Both the Python and Node.js Tensoras SDKs handle 429 responses automatically. By default, they retry up to 3 times with exponential backoff and respect the Retry-After header.
Python
from tensoras import Tensoras
# Default: max_retries=3
client = Tensoras()
# Customize retry behavior
client = Tensoras(max_retries=5)
# Disable automatic retries
client = Tensoras(max_retries=0)Node.js
import Tensoras from "tensoras";
// Default: maxRetries=3
const client = new Tensoras();
// Customize retry behavior
const client = new Tensoras({ maxRetries: 5 });
// Disable automatic retries
const client = new Tensoras({ maxRetries: 0 });When automatic retries are enabled, the SDK will not throw a rate limit error unless all retry attempts are exhausted.
Best Practices
Batch Requests When Possible
If you need to process many items, send them in larger batches rather than one request per item. Use the Batches API for bulk workloads — it runs at lower priority but is not subject to RPM limits.
Implement Client-Side Queuing
For high-throughput applications, maintain a local request queue that throttles outgoing requests to stay under your RPM limit:
import time
from collections import deque
from tensoras import Tensoras
client = Tensoras()
RPM_LIMIT = 600
request_timestamps = deque()
def throttled_request(**kwargs):
now = time.time()
# Remove timestamps older than 60 seconds
while request_timestamps and request_timestamps[0] < now - 60:
request_timestamps.popleft()
# Wait if we are at the limit
if len(request_timestamps) >= RPM_LIMIT:
sleep_time = 60 - (now - request_timestamps[0])
time.sleep(max(sleep_time, 0))
request_timestamps.append(time.time())
return client.chat.completions.create(**kwargs)Use Caching for Repeated Queries
If your application frequently sends the same or similar prompts, cache responses to avoid unnecessary API calls:
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_completion(model, user_message):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": user_message}],
)
return response.choices[0].message.contentFor production systems, consider using Redis or Memcached with a TTL-based eviction policy.
Monitor Usage with Headers
Log the X-RateLimit-Remaining header in your application to track how close you are to your limit. Set up alerts when remaining requests drop below a threshold (e.g., 10% of your limit).
Spread Requests Evenly
Avoid sending all requests in a burst at the start of each minute. Distribute requests evenly across the window to maximize throughput and avoid hitting the limit.
Next Steps
- Billing — understand pricing and spending controls
- Authentication — API key management and scopes
- Batches API — bulk processing without RPM limits