Content Moderation
Tensoras provides a content moderation API to detect harmful content in text, plus guardrail policies that automatically screen inference requests and responses before they reach the model or your users.
Moderation Endpoint
POST /v1/moderations — check text for harmful content across several categories.
Categories
| Category | Description |
|---|---|
harassment | Threatening, intimidating, or abusive language directed at individuals |
hate | Content expressing hatred toward groups based on protected characteristics |
self_harm | Content that promotes, glorifies, or depicts self-harm or suicide |
sexual | Explicit or suggestive sexual content |
violence | Content that depicts or promotes violence against people or animals |
Each category returns both a boolean flagged value and a continuous score from 0 to 1. Higher scores indicate greater confidence that the content belongs to that category.
Python
from tensoras import Tensoras
client = Tensoras(api_key="tns_your_key_here")
response = client.moderations.create(
input="I want to hurt someone.",
)
result = response.results[0]
print(f"Flagged: {result.flagged}")
print(f"Violence score: {result.category_scores.violence:.4f}")
print(f"Harassment score: {result.category_scores.harassment:.4f}")Node.js
import Tensoras from "tensoras";
const client = new Tensoras({ apiKey: "tns_your_key_here" });
const response = await client.moderations.create({
input: "I want to hurt someone.",
});
const result = response.results[0];
console.log(`Flagged: ${result.flagged}`);
console.log(`Violence score: ${result.category_scores.violence.toFixed(4)}`);
console.log(`Harassment score: ${result.category_scores.harassment.toFixed(4)}`);curl
curl https://api.tensoras.ai/v1/moderations \
-H "Authorization: Bearer tns_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"input": "I want to hurt someone."
}'Response Format
{
"id": "modr-abc123",
"model": "text-moderation-latest",
"results": [
{
"flagged": true,
"categories": {
"harassment": false,
"hate": false,
"self-harm": false,
"sexual": false,
"violence": true
},
"category_scores": {
"harassment": 0.0421,
"hate": 0.0089,
"self-harm": 0.0031,
"sexual": 0.0012,
"violence": 0.9147
}
}
]
}The top-level flagged field is true if any category exceeds its threshold. Individual category_scores values are floating-point probabilities between 0 and 1.
You can also pass a list of strings to check multiple inputs in one request:
response = client.moderations.create(
input=["Hello, how are you?", "I want to hurt someone."],
)
for i, result in enumerate(response.results):
print(f"Input {i}: flagged={result.flagged}")Guardrail Policies
Configure per-organization guardrail policies to automatically screen inference requests and responses. Policies are stored and evaluated server-side with minimal latency overhead (keyword and threshold checks add under 10ms).
Create a Policy
from tensoras import Tensoras
client = Tensoras(api_key="tns_your_key_here")
policy = client.guardrail_policies.create(
name="Production Safety Policy",
enabled=True,
block_on_flag=True,
thresholds={
"harassment": 0.7,
"hate": 0.6,
"self_harm": 0.5,
"sexual": 0.8,
"violence": 0.7,
},
topic_deny_list=[
{"topic": "competitor product", "action": "warn"},
{"topic": "internal pricing", "action": "block"},
],
apply_to=["input"],
)
print(f"Created policy: {policy.id}")Category Thresholds
Control how sensitive the guardrail is for each moderation category. Scores that meet or exceed the threshold trigger the policy action.
| Category | Default Threshold | Description |
|---|---|---|
harassment | 0.5 | Threatening or abusive language |
hate | 0.5 | Content expressing hatred toward groups |
self_harm | 0.5 | Self-harm or suicide-related content |
sexual | 0.5 | Explicit or suggestive sexual content |
violence | 0.5 | Violence-related content |
Set a category threshold to null to use the system default (0.5). Set it to 1.0 to effectively disable that category’s check.
Topic Deny-List
Block or warn on specific topics or phrases in user input:
policy = client.guardrail_policies.create(
name="Domain Restriction Policy",
topic_deny_list=[
# Whole-word matching for single words
{"topic": "gambling", "action": "block"},
# Substring matching for multi-word phrases
{"topic": "competitor pricing", "action": "warn"},
# Warn but allow through
{"topic": "alcohol", "action": "warn"},
],
block_on_flag=True,
apply_to=["input"],
)Matching behavior:
- Single words (no spaces): matched as whole words using word-boundary rules. The topic
hatewill match"I hate this"but not"my hatred of injustice". - Multi-word phrases (contains spaces): matched as case-insensitive substrings. The topic
competitor pricingmatches anywhere in the text. - All matching is case-insensitive.
Actions:
| Action | Behavior |
|---|---|
block | If block_on_flag=True, the request is rejected with HTTP 400 |
warn | Always passes through but adds a moderation flag to the response |
Auto-Moderation
When a guardrail policy is active, all inference requests for the organization are automatically screened. No changes to your inference code are required.
Input moderation — user messages are checked before reaching the model:
- If flagged and
block_on_flag=True: returnsHTTP 400with acontent_policy_violationerror. - If flagged and
block_on_flag=False: returns the model response with anX-Moderation-Warning: flaggedresponse header.
Output moderation — model responses are checked before returning to the caller:
- Applies only to non-streaming responses (streaming output moderation is not supported).
- If flagged and
block_on_flag=True: replaces the response content with a filtered message and setsfinish_reason: "content_filter". - If flagged and
block_on_flag=False: returns the response with anX-Moderation-Warning: output_flaggedheader.
The apply_to field controls which direction is moderated:
| Value | Description |
|---|---|
["input"] | Check user messages only (default) |
["output"] | Check model responses only |
["both"] | Check both input and output |
Blocked input example response:
{
"error": {
"message": "Content policy violation: input blocked by guardrail policy.",
"type": "content_policy_violation",
"param": "messages",
"code": "content_filtered",
"reasons": ["Topic 'gambling' matched (blocked)"],
"policy_id": "gp_abc123"
}
}Managing Policies via API
List all policies:
policies = client.guardrail_policies.list()
for policy in policies.data:
print(f"{policy.id}: {policy.name} (enabled={policy.enabled})")curl https://api.tensoras.ai/v1/guardrail-policies \
-H "Authorization: Bearer tns_your_key_here"Update a policy:
updated = client.guardrail_policies.update(
"gp_abc123",
enabled=False,
block_on_flag=False,
)curl -X PUT https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
-H "Authorization: Bearer tns_your_key_here" \
-H "Content-Type: application/json" \
-d '{"enabled": false}'Delete a policy:
client.guardrail_policies.delete("gp_abc123")curl -X DELETE https://api.tensoras.ai/v1/guardrail-policies/gp_abc123 \
-H "Authorization: Bearer tns_your_key_here"Note: Only the first enabled policy for an organization is used for auto-moderation. Policies are evaluated in descending order of creation time.
Best Practices
- Start with warn mode — set
block_on_flag=Falseinitially to observe what gets flagged before enabling blocking. Monitor theX-Moderation-Warningheaders to calibrate your thresholds. - Tune thresholds per category — the default threshold of 0.5 works for most use cases, but consider raising it for categories where your application has legitimate use of borderline content (e.g., a medical app may raise
self_harmthresholds). - Use topic deny-lists for domain restrictions — keyword matching is faster and more predictable than score thresholds for enforcing specific business rules like competitor mentions or confidential topics.
- Apply to both directions for sensitive applications — set
apply_to: ["both"]to screen both user inputs and model outputs when working with sensitive data. - Monitor moderation hits in your audit logs — review which inputs and outputs are flagged regularly to detect emerging misuse patterns and adjust your policy accordingly.