Vision (Multimodal Input)
Tensoras supports image inputs in chat completion messages. Pass images by URL or as base64-encoded data — the model can describe, analyze, compare, and reason about them.
Supported Models
Vision is available on models with vision capability:
llama-3.2-11b-visionllama-3.2-90b-visionpixtral-12b
Sending Images
Images can be included in the content field of a user message as an array of content parts.
Image URL
Python
from tensoras import Tensoras
client = Tensoras()
response = client.chat.completions.create(
model="llama-3.2-11b-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
},
},
],
}
],
)
print(response.choices[0].message.content)Node.js
import { Tensoras } from "@tensoras/sdk";
const client = new Tensoras();
const response = await client.chat.completions.create({
model: "llama-3.2-11b-vision",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What is in this image?" },
{
type: "image_url",
image_url: {
url: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
},
},
],
},
],
});
console.log(response.choices[0].message.content);curl
curl https://api.tensoras.ai/v1/chat/completions \
-H "Authorization: Bearer $TENSORAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-11b-vision",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
}
}
]
}
]
}'Base64 Image
Send locally stored images as base64-encoded data URIs:
import base64
from tensoras import Tensoras
client = Tensoras()
with open("photo.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="llama-3.2-11b-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}",
"detail": "high",
},
},
],
}
],
)Multiple Images
Include multiple images in a single message to compare or analyze them together:
response = client.chat.completions.create(
model="llama-3.2-11b-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What are the differences between these two images?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/before.jpg"},
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/after.jpg"},
},
],
}
],
)Detail Level
The detail parameter controls image resolution and token cost:
| Detail | Token Cost | Use When |
|---|---|---|
"low" | 85 tokens flat | Quick classification, thumbnails |
"high" | 85 base tokens + 170 tokens per 512×512 tile (charged once per image, not per tile) | Detailed analysis, text in images |
"auto" (default) | Adaptive | Let the API decide based on image size |
# Force low detail for cost savings
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg",
"detail": "low"
}
}Remote URLs
When providing image URLs, the server fetches the image at inference time. Remote URLs are not size-validated at the API layer. Ensure URLs point to images under 20 MB; larger images may cause errors.
Global Resolution Override
Set media_resolution on the request to override the detail level for all images at once — useful when you want consistent cost control across many images without setting detail on each one individually.
| Value | Effect |
|---|---|
"low" | Forces all images to low detail (85 tokens each); lowest cost |
"auto" | Default adaptive behavior; the API decides based on image dimensions |
"high" | Forces all images to high detail; highest quality, highest cost |
response = client.chat.completions.create(
model="llama-3.2-11b-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Classify each of these images:"},
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
],
}
],
media_resolution="low", # override all images to low detail
)Image Token Billing
Image tokens are billed at 1.5× the standard input token rate.
Token costs for detail: "high":
- Images are downscaled to fit within 2048×2048 before tiling
- A base cost of 85 tokens is always added per image
- Each 512×512 tile costs an additional 170 tokens
For example, a 1024×1024 image at high detail: 85 base + 4 tiles × 170 = 765 tokens, billed at 1.5× the input rate.
Supported Formats
JPEG, PNG, GIF, WebP. Maximum size: 20 MB.
For animated GIFs, only the first frame dimensions are used for token calculation.
Unsupported Input
The following inputs return HTTP 400 errors:
| Scenario | Error |
|---|---|
| Sending an image to a non-vision model | 400 with message "does not support vision/image inputs" |
| Sending an image in an unsupported format | 400 with a format-specific error message |
Always check the model you are using supports vision before sending image content parts. See Supported Models above for the current list.
Best Practices
- Use
detail: "low"for quick yes/no classification tasks where fine details do not matter. - Use
detail: "high"for reading text in images, analyzing charts, or inspecting fine-grained visual details. - Images larger than 2048px on any side are automatically downscaled before tiling — you do not need to resize them yourself.
- Base64 data URIs are best for local files. Public URLs are best for remotely hosted images, as they avoid inflating request payload size.
- Avoid sending the same large image multiple times within one conversation — reference the URL repeatedly instead of re-encoding the bytes.