GuidesRate Limits

Rate Limits

Rate limits protect the Tensoras.ai platform and ensure fair usage across all customers. This guide explains the limits for each plan tier, how to detect and handle limit errors, and best practices for staying within your allocation.

Limits by Plan

Rate limits are applied per API key and measured in two dimensions: requests per minute (RPM) and tokens per minute (TPM).

PlanRPMTPMKnowledge BasesStorage
Free30100K21 GB
Developer ($29/mo)6001M55 GB
Pro ($49/mo)3,0005M1010 GB
Enterprise (custom)10,000CustomUnlimitedCustom

Both RPM and TPM limits are enforced per rolling one-minute window. If either limit is exceeded, the API returns a 429 Too Many Requests error.

Tip: If you consistently hit your plan’s limits, consider upgrading your plan or contacting sales for an Enterprise agreement with custom limits.

Rate Limit Headers

Every API response includes headers that report your current rate limit status:

HeaderDescription
X-RateLimit-LimitMaximum requests allowed per minute for this key
X-RateLimit-RemainingRequests remaining in the current one-minute window
X-RateLimit-ResetUnix timestamp (seconds) when the current window resets

These headers are present on both successful responses and 429 errors, so you can monitor your usage proactively.

curl -i https://api.tensoras.ai/v1/chat/completions \
  -H "Authorization: Bearer $TENSORAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
Response Headers
HTTP/2 200
X-RateLimit-Limit: 600
X-RateLimit-Remaining: 598
X-RateLimit-Reset: 1708200120

Handling 429 Errors

When you exceed your rate limit, the API returns a 429 response with a Retry-After header indicating how many seconds to wait:

429 Response Body
{
  "error": {
    "message": "Rate limit exceeded. Please retry after 12 seconds.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}
HeaderDescription
Retry-AfterNumber of seconds to wait before sending the next request

Exponential Backoff

If you are writing your own retry logic, use exponential backoff with jitter:

import time
import random
import requests
 
def call_with_backoff(url, headers, payload, max_retries=5):
    for attempt in range(max_retries):
        response = requests.post(url, headers=headers, json=payload)
 
        if response.status_code != 429:
            return response
 
        # Respect the Retry-After header if present
        retry_after = int(response.headers.get("Retry-After", 1))
        wait = retry_after + random.uniform(0, 1)
        print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1})")
        time.sleep(wait)
 
    raise Exception("Max retries exceeded")
async function callWithBackoff(url, options, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url, options);
 
    if (response.status !== 429) {
      return response;
    }
 
    const retryAfter = parseInt(response.headers.get("Retry-After") || "1", 10);
    const wait = retryAfter + Math.random();
    console.log(`Rate limited. Retrying in ${wait.toFixed(1)}s (attempt ${attempt + 1})`);
    await new Promise((resolve) => setTimeout(resolve, wait * 1000));
  }
 
  throw new Error("Max retries exceeded");
}

SDK Automatic Retry Behavior

Both the Python and Node.js Tensoras SDKs handle 429 responses automatically. By default, they retry up to 3 times with exponential backoff and respect the Retry-After header.

Python

from tensoras import Tensoras
 
# Default: max_retries=3
client = Tensoras()
 
# Customize retry behavior
client = Tensoras(max_retries=5)
 
# Disable automatic retries
client = Tensoras(max_retries=0)

Node.js

import Tensoras from "tensoras";
 
// Default: maxRetries=3
const client = new Tensoras();
 
// Customize retry behavior
const client = new Tensoras({ maxRetries: 5 });
 
// Disable automatic retries
const client = new Tensoras({ maxRetries: 0 });

When automatic retries are enabled, the SDK will not throw a rate limit error unless all retry attempts are exhausted.

Best Practices

Batch Requests When Possible

If you need to process many items, send them in larger batches rather than one request per item. Use the Batches API for bulk workloads — it runs at lower priority but is not subject to RPM limits.

Implement Client-Side Queuing

For high-throughput applications, maintain a local request queue that throttles outgoing requests to stay under your RPM limit:

import time
from collections import deque
from tensoras import Tensoras
 
client = Tensoras()
RPM_LIMIT = 600
request_timestamps = deque()
 
def throttled_request(**kwargs):
    now = time.time()
 
    # Remove timestamps older than 60 seconds
    while request_timestamps and request_timestamps[0] < now - 60:
        request_timestamps.popleft()
 
    # Wait if we are at the limit
    if len(request_timestamps) >= RPM_LIMIT:
        sleep_time = 60 - (now - request_timestamps[0])
        time.sleep(max(sleep_time, 0))
 
    request_timestamps.append(time.time())
    return client.chat.completions.create(**kwargs)

Use Caching for Repeated Queries

If your application frequently sends the same or similar prompts, cache responses to avoid unnecessary API calls:

from functools import lru_cache
 
@lru_cache(maxsize=1000)
def cached_completion(model, user_message):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_message}],
    )
    return response.choices[0].message.content

For production systems, consider using Redis or Memcached with a TTL-based eviction policy.

Monitor Usage with Headers

Log the X-RateLimit-Remaining header in your application to track how close you are to your limit. Set up alerts when remaining requests drop below a threshold (e.g., 10% of your limit).

Spread Requests Evenly

Avoid sending all requests in a burst at the start of each minute. Distribute requests evenly across the window to maximize throughput and avoid hitting the limit.

Next Steps