Content Moderation

Tensoras provides a content moderation API to detect harmful content in text, plus guardrail policies that automatically screen inference requests and responses before they reach the model or your users.

Moderation Endpoint

POST /v1/moderations — check text for harmful content across several categories.

Category	Description
`harassment`	Threatening, intimidating, or abusive language directed at individuals
`hate`	Content expressing hatred toward groups based on protected characteristics
`self_harm`	Content that promotes, glorifies, or depicts self-harm or suicide
`sexual`	Explicit or suggestive sexual content
`violence`	Content that depicts or promotes violence against people or animals

Python

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
response = client.moderations.create(
    input="I want to hurt someone.",
)
 
result = response.results[0]
print(f"Flagged: {result.flagged}")
print(f"Violence score: {result.category_scores.violence:.4f}")
print(f"Harassment score: {result.category_scores.harassment:.4f}")

Node.js

import Tensoras from "tensoras";
 
const client = new Tensoras({ apiKey: "tns_your_key_here" });
 
const response = await client.moderations.create({
  input: "I want to hurt someone.",
});
 
const result = response.results[0];
console.log(`Flagged: ${result.flagged}`);
console.log(`Violence score: ${result.category_scores.violence.toFixed(4)}`);
console.log(`Harassment score: ${result.category_scores.harassment.toFixed(4)}`);

curl

curl https://api.tensoras.ai/v1/moderations \
  -H "Authorization: Bearer tns_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "I want to hurt someone."
  }'

Response Format

{
  "id": "modr-abc123",
  "model": "text-moderation-latest",
  "results": [
    {
      "flagged": true,
      "categories": {
        "harassment": false,
        "hate": false,
        "self-harm": false,
        "sexual": false,
        "violence": true
      },
      "category_scores": {
        "harassment": 0.0421,
        "hate": 0.0089,
        "self-harm": 0.0031,
        "sexual": 0.0012,
        "violence": 0.9147
      }
    }
  ]
}

The top-level flagged field is true if any category exceeds its threshold. Individual category_scores values are floating-point probabilities between 0 and 1.

You can also pass a list of strings to check multiple inputs in one request:

response = client.moderations.create(
    input=["Hello, how are you?", "I want to hurt someone."],
)
 
for i, result in enumerate(response.results):
    print(f"Input {i}: flagged={result.flagged}")

Guardrail Policies

Configure per-organization guardrail policies to automatically screen inference requests and responses. Policies are stored and evaluated server-side with minimal latency overhead (keyword and threshold checks add under 10ms).

Create a Policy

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
policy = client.guardrail_policies.create(
    name="Production Safety Policy",
    enabled=True,
    block_on_flag=True,
    thresholds={
        "harassment": 0.7,
        "hate": 0.6,
        "self_harm": 0.5,
        "sexual": 0.8,
        "violence": 0.7,
    },
    topic_deny_list=[
        {"topic": "competitor product", "action": "warn"},
        {"topic": "internal pricing", "action": "block"},
    ],
    apply_to=["input"],
)
 
print(f"Created policy: {policy.id}")

Category Thresholds

Control how sensitive the guardrail is for each moderation category. Scores that meet or exceed the threshold trigger the policy action.

Category	Default Threshold	Description
`harassment`	0.5	Threatening or abusive language
`hate`	0.5	Content expressing hatred toward groups
`self_harm`	0.5	Self-harm or suicide-related content
`sexual`	0.5	Explicit or suggestive sexual content
`violence`	0.5	Violence-related content

Set a category threshold to null to use the system default (0.5). Set it to 1.0 to effectively disable that category’s check.

Topic Deny-List

Block or warn on specific topics or phrases in user input:

policy = client.guardrail_policies.create(
    name="Domain Restriction Policy",
    topic_deny_list=[
        # Whole-word matching for single words
        {"topic": "gambling", "action": "block"},
        # Substring matching for multi-word phrases
        {"topic": "competitor pricing", "action": "warn"},
        # Warn but allow through
        {"topic": "alcohol", "action": "warn"},
    ],
    block_on_flag=True,
    apply_to=["input"],
)

Matching behavior:

Single words (no spaces): matched as whole words using word-boundary rules. The topic hate will match "I hate this" but not "my hatred of injustice".
Multi-word phrases (contains spaces): matched as case-insensitive substrings. The topic competitor pricing matches anywhere in the text.
All matching is case-insensitive.

Actions:

Action	Behavior
`block`	If `block_on_flag=True`, the request is rejected with HTTP 400
`warn`	Always passes through but adds a moderation flag to the response

Auto-Moderation

When a guardrail policy is active, all inference requests for the organization are automatically screened. No changes to your inference code are required.

Input moderation — user messages are checked before reaching the model:

If flagged and block_on_flag=True: returns HTTP 400 with a content_policy_violation error.
If flagged and block_on_flag=False: returns the model response with an X-Moderation-Warning: flagged response header.

Output moderation — model responses are checked before returning to the caller:

Applies only to non-streaming responses (streaming output moderation is not supported).
If flagged and block_on_flag=True: replaces the response content with a filtered message and sets finish_reason: "content_filter".
If flagged and block_on_flag=False: returns the response with an X-Moderation-Warning: output_flagged header.

The apply_to field controls which direction is moderated:

Value	Description
`["input"]`	Check user messages only (default)
`["output"]`	Check model responses only
`["both"]`	Check both input and output

Blocked input example response:

{
  "error": {
    "message": "Content policy violation: input blocked by guardrail policy.",
    "type": "content_policy_violation",
    "param": "messages",
    "code": "content_filtered",
    "reasons": ["Topic 'gambling' matched (blocked)"],
    "policy_id": "gp_abc123"
  }
}

Managing Policies via API

List all policies:

policies = client.guardrail_policies.list()
for policy in policies.data:
    print(f"{policy.id}: {policy.name} (enabled={policy.enabled})")

curl https://api.tensoras.ai/v1/guardrail-policies \
  -H "Authorization: Bearer tns_your_key_here"

Update a policy:

updated = client.guardrail_policies.update(
    "gp_abc123",
    enabled=False,
    block_on_flag=False,
)

curl -X PUT https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
  -H "Authorization: Bearer tns_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Delete a policy:

client.guardrail_policies.delete("gp_abc123")

curl -X DELETE https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
  -H "Authorization: Bearer tns_your_key_here"

Note: Only the first enabled policy for an organization is used for auto-moderation. Policies are evaluated in descending order of creation time.

Best Practices

Start with warn mode — set block_on_flag=False initially to observe what gets flagged before enabling blocking. Monitor the X-Moderation-Warning headers to calibrate your thresholds.
Tune thresholds per category — the default threshold of 0.5 works for most use cases, but consider raising it for categories where your application has legitimate use of borderline content (e.g., a medical app may raise self_harm thresholds).
Use topic deny-lists for domain restrictions — keyword matching is faster and more predictable than score thresholds for enforcing specific business rules like competitor mentions or confidential topics.
Apply to both directions for sensitive applications — set apply_to: ["both"] to screen both user inputs and model outputs when working with sensitive data.
Monitor moderation hits in your audit logs — review which inputs and outputs are flagged regularly to detect emerging misuse patterns and adjust your policy accordingly.

Vision & Multimodal RAG Overview