Introduction

Tensoras.ai is an open-source AI inference-as-a-service platform that gives you fast, reliable access to leading open-weight language models through a single OpenAI-compatible API. Deploy chat completions, embeddings, reranking, and retrieval-augmented generation (RAG) without managing GPU infrastructure, model weights, or serving frameworks.

Tensoras delivers high-throughput inference with automatic load balancing, streaming support, and built-in Knowledge Bases for RAG workflows. Drop in your existing OpenAI SDK code, point it at https://api.tensoras.ai/v1, and start running inference immediately.

Key Features

OpenAI-compatible API — swap your base URL and keep your existing code
Chat completions with streaming, tool calling, JSON mode, and structured outputs
Embeddings and reranking for search and retrieval pipelines
Knowledge Bases — managed RAG with hybrid search (vector + keyword), citations, and connectors for S3, Confluence, Notion, web crawlers, and file uploads
Responses API — agentic tool-calling loop that runs multi-turn retrieval and reasoning server-side in a single request
Leading open-weight models — Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 70B, Codestral 22B, and more
High-throughput inference — optimized serving for maximum performance
SDKs for Python and Node.js with full type safety
Prompt caching for reduced latency on repeated prefixes

Quick Example

Get a chat completion in a few lines of Python:

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformers in two sentences."},
    ],
)
 
print(response.choices[0].message.content)

Output

Transformers are a neural network architecture that uses self-attention mechanisms
to process input sequences in parallel, enabling efficient capture of long-range
dependencies. They form the backbone of modern large language models like GPT and
Llama, powering tasks from text generation to translation.

Why Tensoras?

Zero infrastructure to manage

No GPUs to provision, no model weights to download, no CUDA drivers to debug. Send an API request and get a response.

OpenAI-compatible from day one

Tensoras implements the OpenAI API spec. If your code works with the OpenAI SDK, it works with Tensoras — just change the base URL and API key.

Built-in RAG with Knowledge Bases

Upload documents, connect data sources, and query with citations in a single platform. No need to wire together a separate vector database, chunking pipeline, and retrieval service.

Open-weight models, no vendor lock-in

Run Llama, Qwen, Mistral, DeepSeek, and Codestral. Your prompts and data stay portable across any provider that serves the same models.

High throughput at low cost

Tensoras delivers best-in-class throughput for open-weight models and passes those efficiency gains on to you through competitive pricing.

Next Steps

Quickstart — go from zero to your first API call in under five minutes
Authentication — API keys, scopes, and rate limits
API Reference — full endpoint documentation
SDKs — Python and Node.js client libraries
RAG Overview — Knowledge Bases, hybrid search, and connectors
Integrations — LangChain, LlamaIndex, Vercel AI SDK, and more

Quickstart