Vision (Multimodal Input)

Tensoras supports image inputs in chat completion messages. Pass images by URL or as base64-encoded data — the model can describe, analyze, compare, and reason about them.

Supported Models

Vision is available on models with vision capability:

llama-3.2-11b-vision
llama-3.2-90b-vision
pixtral-12b

Sending Images

Images can be included in the content field of a user message as an array of content parts.

Image URL

Python

from tensoras import Tensoras
 
client = Tensoras()
 
response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

Node.js

import { Tensoras } from "@tensoras/sdk";
 
const client = new Tensoras();
 
const response = await client.chat.completions.create({
  model: "llama-3.2-11b-vision",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What is in this image?" },
        {
          type: "image_url",
          image_url: {
            url: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
          },
        },
      ],
    },
  ],
});
console.log(response.choices[0].message.content);

curl

curl https://api.tensoras.ai/v1/chat/completions \
  -H "Authorization: Bearer $TENSORAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-11b-vision",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
            }
          }
        ]
      }
    ]
  }'

Base64 Image

Send locally stored images as base64-encoded data URIs:

import base64
from tensoras import Tensoras
 
client = Tensoras()
 
with open("photo.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")
 
response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
)

Multiple Images

Include multiple images in a single message to compare or analyze them together:

response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the differences between these two images?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/before.jpg"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/after.jpg"},
                },
            ],
        }
    ],
)

Detail Level

The detail parameter controls image resolution and token cost:

Detail	Token Cost	Use When
`"low"`	85 tokens flat	Quick classification, thumbnails
`"high"`	85 base tokens + 170 tokens per 512×512 tile (charged once per image, not per tile)	Detailed analysis, text in images
`"auto"` (default)	Adaptive	Let the API decide based on image size

# Force low detail for cost savings
{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/photo.jpg",
        "detail": "low"
    }
}

Remote URLs

When providing image URLs, the server fetches the image at inference time. Remote URLs are not size-validated at the API layer. Ensure URLs point to images under 20 MB; larger images may cause errors.

Global Resolution Override

Set media_resolution on the request to override the detail level for all images at once — useful when you want consistent cost control across many images without setting detail on each one individually.

Value	Effect
`"low"`	Forces all images to low detail (85 tokens each); lowest cost
`"auto"`	Default adaptive behavior; the API decides based on image dimensions
`"high"`	Forces all images to high detail; highest quality, highest cost

response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify each of these images:"},
                {"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
            ],
        }
    ],
    media_resolution="low",  # override all images to low detail
)

Image Token Billing

Image tokens are billed at 1.5× the standard input token rate.

Token costs for detail: "high":

Images are downscaled to fit within 2048×2048 before tiling
A base cost of 85 tokens is always added per image
Each 512×512 tile costs an additional 170 tokens

For example, a 1024×1024 image at high detail: 85 base + 4 tiles × 170 = 765 tokens, billed at 1.5× the input rate.

Supported Formats

JPEG, PNG, GIF, WebP. Maximum size: 20 MB.

For animated GIFs, only the first frame dimensions are used for token calculation.

Unsupported Input

The following inputs return HTTP 400 errors:

Scenario	Error
Sending an image to a non-vision model	`400` with message `"does not support vision/image inputs"`
Sending an image in an unsupported format	`400` with a format-specific error message

Always check the model you are using supports vision before sending image content parts. See Supported Models above for the current list.

Best Practices

Use detail: "low" for quick yes/no classification tasks where fine details do not matter.
Use detail: "high" for reading text in images, analyzing charts, or inspecting fine-grained visual details.
Images larger than 2048px on any side are automatically downscaled before tiling — you do not need to resize them yourself.
Base64 data URIs are best for local files. Public URLs are best for remotely hosted images, as they avoid inflating request payload size.
Avoid sending the same large image multiple times within one conversation — reference the URL repeatedly instead of re-encoding the bytes.

Prompt Caching Content Moderation