Introduction
Tensoras.ai is an open-source AI inference-as-a-service platform that gives you fast, reliable access to leading open-weight language models through a single OpenAI-compatible API. Deploy chat completions, embeddings, reranking, and retrieval-augmented generation (RAG) without managing GPU infrastructure, model weights, or serving frameworks.
Tensoras delivers high-throughput inference with automatic load balancing, streaming support, and built-in Knowledge Bases for RAG workflows. Drop in your existing OpenAI SDK code, point it at https://api.tensoras.ai/v1, and start running inference immediately.
Key Features
- OpenAI-compatible API — swap your base URL and keep your existing code
- Chat completions with streaming, tool calling, JSON mode, and structured outputs
- Embeddings and reranking for search and retrieval pipelines
- Knowledge Bases — managed RAG with hybrid search (vector + keyword), citations, and connectors for S3, Confluence, Notion, web crawlers, and file uploads
- Responses API — agentic tool-calling loop that runs multi-turn retrieval and reasoning server-side in a single request
- Leading open-weight models — Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 70B, Codestral 22B, and more
- High-throughput inference — optimized serving for maximum performance
- SDKs for Python and Node.js with full type safety
- Prompt caching for reduced latency on repeated prefixes
Quick Example
Get a chat completion in a few lines of Python:
from tensoras import Tensoras
client = Tensoras(api_key="tns_your_key_here")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformers in two sentences."},
],
)
print(response.choices[0].message.content)Transformers are a neural network architecture that uses self-attention mechanisms
to process input sequences in parallel, enabling efficient capture of long-range
dependencies. They form the backbone of modern large language models like GPT and
Llama, powering tasks from text generation to translation.Why Tensoras?
Zero infrastructure to manage
No GPUs to provision, no model weights to download, no CUDA drivers to debug. Send an API request and get a response.
OpenAI-compatible from day one
Tensoras implements the OpenAI API spec. If your code works with the OpenAI SDK, it works with Tensoras — just change the base URL and API key.
Built-in RAG with Knowledge Bases
Upload documents, connect data sources, and query with citations in a single platform. No need to wire together a separate vector database, chunking pipeline, and retrieval service.
Open-weight models, no vendor lock-in
Run Llama, Qwen, Mistral, DeepSeek, and Codestral. Your prompts and data stay portable across any provider that serves the same models.
High throughput at low cost
Tensoras delivers best-in-class throughput for open-weight models and passes those efficiency gains on to you through competitive pricing.
Next Steps
- Quickstart — go from zero to your first API call in under five minutes
- Authentication — API keys, scopes, and rate limits
- API Reference — full endpoint documentation
- SDKs — Python and Node.js client libraries
- RAG Overview — Knowledge Bases, hybrid search, and connectors
- Integrations — LangChain, LlamaIndex, Vercel AI SDK, and more