FeaturesVision & Multimodal

Vision (Multimodal Input)

Tensoras supports image inputs in chat completion messages. Pass images by URL or as base64-encoded data — the model can describe, analyze, compare, and reason about them.

Supported Models

Vision is available on models with vision capability:

  • llama-3.2-11b-vision
  • llama-3.2-90b-vision
  • pixtral-12b

Sending Images

Images can be included in the content field of a user message as an array of content parts.

Image URL

Python

from tensoras import Tensoras
 
client = Tensoras()
 
response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
                    },
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

Node.js

import { Tensoras } from "@tensoras/sdk";
 
const client = new Tensoras();
 
const response = await client.chat.completions.create({
  model: "llama-3.2-11b-vision",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What is in this image?" },
        {
          type: "image_url",
          image_url: {
            url: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png",
          },
        },
      ],
    },
  ],
});
console.log(response.choices[0].message.content);

curl

curl https://api.tensoras.ai/v1/chat/completions \
  -H "Authorization: Bearer $TENSORAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.2-11b-vision",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
            }
          }
        ]
      }
    ]
  }'

Base64 Image

Send locally stored images as base64-encoded data URIs:

import base64
from tensoras import Tensoras
 
client = Tensoras()
 
with open("photo.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")
 
response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{image_data}",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
)

Multiple Images

Include multiple images in a single message to compare or analyze them together:

response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What are the differences between these two images?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/before.jpg"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/after.jpg"},
                },
            ],
        }
    ],
)

Detail Level

The detail parameter controls image resolution and token cost:

DetailToken CostUse When
"low"85 tokens flatQuick classification, thumbnails
"high"85 base tokens + 170 tokens per 512×512 tile (charged once per image, not per tile)Detailed analysis, text in images
"auto" (default)AdaptiveLet the API decide based on image size
# Force low detail for cost savings
{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/photo.jpg",
        "detail": "low"
    }
}

Remote URLs

When providing image URLs, the server fetches the image at inference time. Remote URLs are not size-validated at the API layer. Ensure URLs point to images under 20 MB; larger images may cause errors.

Global Resolution Override

Set media_resolution on the request to override the detail level for all images at once — useful when you want consistent cost control across many images without setting detail on each one individually.

ValueEffect
"low"Forces all images to low detail (85 tokens each); lowest cost
"auto"Default adaptive behavior; the API decides based on image dimensions
"high"Forces all images to high detail; highest quality, highest cost
response = client.chat.completions.create(
    model="llama-3.2-11b-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Classify each of these images:"},
                {"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/img3.jpg"}},
            ],
        }
    ],
    media_resolution="low",  # override all images to low detail
)

Image Token Billing

Image tokens are billed at 1.5× the standard input token rate.

Token costs for detail: "high":

  • Images are downscaled to fit within 2048×2048 before tiling
  • A base cost of 85 tokens is always added per image
  • Each 512×512 tile costs an additional 170 tokens

For example, a 1024×1024 image at high detail: 85 base + 4 tiles × 170 = 765 tokens, billed at 1.5× the input rate.

Supported Formats

JPEG, PNG, GIF, WebP. Maximum size: 20 MB.

For animated GIFs, only the first frame dimensions are used for token calculation.

Unsupported Input

The following inputs return HTTP 400 errors:

ScenarioError
Sending an image to a non-vision model400 with message "does not support vision/image inputs"
Sending an image in an unsupported format400 with a format-specific error message

Always check the model you are using supports vision before sending image content parts. See Supported Models above for the current list.

Best Practices

  • Use detail: "low" for quick yes/no classification tasks where fine details do not matter.
  • Use detail: "high" for reading text in images, analyzing charts, or inspecting fine-grained visual details.
  • Images larger than 2048px on any side are automatically downscaled before tiling — you do not need to resize them yourself.
  • Base64 data URIs are best for local files. Public URLs are best for remotely hosted images, as they avoid inflating request payload size.
  • Avoid sending the same large image multiple times within one conversation — reference the URL repeatedly instead of re-encoding the bytes.