FeaturesContent Moderation

Content Moderation

Tensoras provides a content moderation API to detect harmful content in text, plus guardrail policies that automatically screen inference requests and responses before they reach the model or your users.

Moderation Endpoint

POST /v1/moderations — check text for harmful content across several categories.

Categories

CategoryDescription
harassmentThreatening, intimidating, or abusive language directed at individuals
hateContent expressing hatred toward groups based on protected characteristics
self_harmContent that promotes, glorifies, or depicts self-harm or suicide
sexualExplicit or suggestive sexual content
violenceContent that depicts or promotes violence against people or animals

Each category returns both a boolean flagged value and a continuous score from 0 to 1. Higher scores indicate greater confidence that the content belongs to that category.

Python

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
response = client.moderations.create(
    input="I want to hurt someone.",
)
 
result = response.results[0]
print(f"Flagged: {result.flagged}")
print(f"Violence score: {result.category_scores.violence:.4f}")
print(f"Harassment score: {result.category_scores.harassment:.4f}")

Node.js

import Tensoras from "tensoras";
 
const client = new Tensoras({ apiKey: "tns_your_key_here" });
 
const response = await client.moderations.create({
  input: "I want to hurt someone.",
});
 
const result = response.results[0];
console.log(`Flagged: ${result.flagged}`);
console.log(`Violence score: ${result.category_scores.violence.toFixed(4)}`);
console.log(`Harassment score: ${result.category_scores.harassment.toFixed(4)}`);

curl

curl https://api.tensoras.ai/v1/moderations \
  -H "Authorization: Bearer tns_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "I want to hurt someone."
  }'

Response Format

{
  "id": "modr-abc123",
  "model": "text-moderation-latest",
  "results": [
    {
      "flagged": true,
      "categories": {
        "harassment": false,
        "hate": false,
        "self-harm": false,
        "sexual": false,
        "violence": true
      },
      "category_scores": {
        "harassment": 0.0421,
        "hate": 0.0089,
        "self-harm": 0.0031,
        "sexual": 0.0012,
        "violence": 0.9147
      }
    }
  ]
}

The top-level flagged field is true if any category exceeds its threshold. Individual category_scores values are floating-point probabilities between 0 and 1.

You can also pass a list of strings to check multiple inputs in one request:

response = client.moderations.create(
    input=["Hello, how are you?", "I want to hurt someone."],
)
 
for i, result in enumerate(response.results):
    print(f"Input {i}: flagged={result.flagged}")

Guardrail Policies

Configure per-organization guardrail policies to automatically screen inference requests and responses. Policies are stored and evaluated server-side with minimal latency overhead (keyword and threshold checks add under 10ms).

Create a Policy

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
policy = client.guardrail_policies.create(
    name="Production Safety Policy",
    enabled=True,
    block_on_flag=True,
    thresholds={
        "harassment": 0.7,
        "hate": 0.6,
        "self_harm": 0.5,
        "sexual": 0.8,
        "violence": 0.7,
    },
    topic_deny_list=[
        {"topic": "competitor product", "action": "warn"},
        {"topic": "internal pricing", "action": "block"},
    ],
    apply_to=["input"],
)
 
print(f"Created policy: {policy.id}")

Category Thresholds

Control how sensitive the guardrail is for each moderation category. Scores that meet or exceed the threshold trigger the policy action.

CategoryDefault ThresholdDescription
harassment0.5Threatening or abusive language
hate0.5Content expressing hatred toward groups
self_harm0.5Self-harm or suicide-related content
sexual0.5Explicit or suggestive sexual content
violence0.5Violence-related content

Set a category threshold to null to use the system default (0.5). Set it to 1.0 to effectively disable that category’s check.

Topic Deny-List

Block or warn on specific topics or phrases in user input:

policy = client.guardrail_policies.create(
    name="Domain Restriction Policy",
    topic_deny_list=[
        # Whole-word matching for single words
        {"topic": "gambling", "action": "block"},
        # Substring matching for multi-word phrases
        {"topic": "competitor pricing", "action": "warn"},
        # Warn but allow through
        {"topic": "alcohol", "action": "warn"},
    ],
    block_on_flag=True,
    apply_to=["input"],
)

Matching behavior:

  • Single words (no spaces): matched as whole words using word-boundary rules. The topic hate will match "I hate this" but not "my hatred of injustice".
  • Multi-word phrases (contains spaces): matched as case-insensitive substrings. The topic competitor pricing matches anywhere in the text.
  • All matching is case-insensitive.

Actions:

ActionBehavior
blockIf block_on_flag=True, the request is rejected with HTTP 400
warnAlways passes through but adds a moderation flag to the response

Auto-Moderation

When a guardrail policy is active, all inference requests for the organization are automatically screened. No changes to your inference code are required.

Input moderation — user messages are checked before reaching the model:

  • If flagged and block_on_flag=True: returns HTTP 400 with a content_policy_violation error.
  • If flagged and block_on_flag=False: returns the model response with an X-Moderation-Warning: flagged response header.

Output moderation — model responses are checked before returning to the caller:

  • Applies only to non-streaming responses (streaming output moderation is not supported).
  • If flagged and block_on_flag=True: replaces the response content with a filtered message and sets finish_reason: "content_filter".
  • If flagged and block_on_flag=False: returns the response with an X-Moderation-Warning: output_flagged header.

The apply_to field controls which direction is moderated:

ValueDescription
["input"]Check user messages only (default)
["output"]Check model responses only
["both"]Check both input and output

Blocked input example response:

{
  "error": {
    "message": "Content policy violation: input blocked by guardrail policy.",
    "type": "content_policy_violation",
    "param": "messages",
    "code": "content_filtered",
    "reasons": ["Topic 'gambling' matched (blocked)"],
    "policy_id": "gp_abc123"
  }
}

Managing Policies via API

List all policies:

policies = client.guardrail_policies.list()
for policy in policies.data:
    print(f"{policy.id}: {policy.name} (enabled={policy.enabled})")
curl https://api.tensoras.ai/v1/guardrail-policies \
  -H "Authorization: Bearer tns_your_key_here"

Update a policy:

updated = client.guardrail_policies.update(
    "gp_abc123",
    enabled=False,
    block_on_flag=False,
)
curl -X PUT https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
  -H "Authorization: Bearer tns_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"enabled": false}'

Delete a policy:

client.guardrail_policies.delete("gp_abc123")
curl -X DELETE https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
  -H "Authorization: Bearer tns_your_key_here"

Note: Only the first enabled policy for an organization is used for auto-moderation. Policies are evaluated in descending order of creation time.

Best Practices

  • Start with warn mode — set block_on_flag=False initially to observe what gets flagged before enabling blocking. Monitor the X-Moderation-Warning headers to calibrate your thresholds.
  • Tune thresholds per category — the default threshold of 0.5 works for most use cases, but consider raising it for categories where your application has legitimate use of borderline content (e.g., a medical app may raise self_harm thresholds).
  • Use topic deny-lists for domain restrictions — keyword matching is faster and more predictable than score thresholds for enforcing specific business rules like competitor mentions or confidential topics.
  • Apply to both directions for sensitive applications — set apply_to: ["both"] to screen both user inputs and model outputs when working with sensitive data.
  • Monitor moderation hits in your audit logs — review which inputs and outputs are flagged regularly to detect emerging misuse patterns and adjust your policy accordingly.