S3 Connector
The S3 connector lets you use an Amazon S3 bucket as a data source for a Tensoras Knowledge Base. Documents in the bucket are automatically ingested, chunked, embedded, and indexed so you can query them via RAG.
Overview
The S3 connector supports:
- Automatic ingestion of files from any S3 bucket or prefix
- Incremental sync — only new or modified files are re-ingested on subsequent syncs
- Scheduled sync — set a recurring schedule to keep your Knowledge Base up to date
- Multiple file types — PDF, TXT, MD, DOCX, HTML, CSV, and JSON
Supported File Types
| Extension | Format |
|---|---|
.pdf | PDF documents |
.txt | Plain text |
.md | Markdown |
.docx | Microsoft Word |
.html | HTML pages |
.csv | Comma-separated values |
.json | JSON documents |
Files with unsupported extensions are skipped during ingestion.
Setup
Step 1: Create a Knowledge Base
If you do not already have a Knowledge Base, create one first:
from tensoras import Tensoras
client = Tensoras()
kb = client.knowledge_bases.create(
name="company-docs",
description="Internal company documentation from S3",
)
print(kb.id) # e.g. "kb_a1b2c3d4"import Tensoras from "tensoras";
const client = new Tensoras();
const kb = await client.knowledgeBases.create({
name: "company-docs",
description: "Internal company documentation from S3",
});
console.log(kb.id); // e.g. "kb_a1b2c3d4"Step 2: Add the S3 Data Source
Connect your S3 bucket by providing the bucket name, an optional prefix (folder path), and AWS credentials:
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id=kb.id,
type="s3",
config={
"bucket": "my-company-docs",
"prefix": "knowledge-base/", # optional, limits to a folder
"region": "us-east-1",
"access_key_id": "AKIA...",
"secret_access_key": "wJalr...",
},
)
print(data_source.id) # e.g. "ds_x1y2z3"const dataSource = await client.knowledgeBases.dataSources.create({
knowledgeBaseId: kb.id,
type: "s3",
config: {
bucket: "my-company-docs",
prefix: "knowledge-base/", // optional, limits to a folder
region: "us-east-1",
accessKeyId: "AKIA...",
secretAccessKey: "wJalr...",
},
});
console.log(dataSource.id); // e.g. "ds_x1y2z3"curl -X POST https://api.tensoras.ai/v1/knowledge_bases/kb_a1b2c3d4/data_sources \
-H "Authorization: Bearer $TENSORAS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"type": "s3",
"config": {
"bucket": "my-company-docs",
"prefix": "knowledge-base/",
"region": "us-east-1",
"access_key_id": "AKIA...",
"secret_access_key": "wJalr..."
}
}'Security: AWS credentials are encrypted at rest. For improved security, use an IAM role with minimal permissions (see IAM Permissions below) and rotate credentials regularly.
Step 3: Trigger a Sync
After creating the data source, trigger the initial sync to start ingesting files:
job = client.knowledge_bases.data_sources.sync(
knowledge_base_id=kb.id,
data_source_id=data_source.id,
)
print(job.id) # e.g. "job_m1n2o3"
print(job.status) # "processing"const job = await client.knowledgeBases.dataSources.sync({
knowledgeBaseId: kb.id,
dataSourceId: dataSource.id,
});
console.log(job.id); // e.g. "job_m1n2o3"
console.log(job.status); // "processing"Step 4: Monitor the Ingestion Job
Poll the ingestion job status or check it in the Console:
import time
while True:
job = client.ingestion_jobs.retrieve(job_id=job.id)
print(f"Status: {job.status}, Documents: {job.documents_processed}/{job.documents_total}")
if job.status in ("completed", "failed"):
break
time.sleep(5)let status = "processing";
while (status === "processing") {
const updatedJob = await client.ingestionJobs.retrieve({ jobId: job.id });
console.log(`Status: ${updatedJob.status}, Documents: ${updatedJob.documentsProcessed}/${updatedJob.documentsTotal}`);
status = updatedJob.status;
if (status === "processing") {
await new Promise((resolve) => setTimeout(resolve, 5000));
}
}You can also monitor ingestion jobs in Console > Knowledge Bases > [Your KB] > Ingestion Jobs.
Step 5: Query with RAG
Once ingestion is complete, query your Knowledge Base in a chat completion:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "What is our vacation policy?"},
],
knowledge_bases=[kb.id],
)
print(response.choices[0].message.content)Incremental Sync
After the initial full sync, subsequent syncs are incremental:
- New files in the bucket/prefix are ingested and added to the index.
- Modified files (detected by S3 ETag or LastModified timestamp) are re-ingested and their old chunks are replaced.
- Deleted files are removed from the index.
This means syncing a bucket with 10,000 files where only 5 have changed will process only those 5 files.
Scheduled Sync
Set up automatic periodic syncs to keep your Knowledge Base current:
client.knowledge_bases.data_sources.update(
knowledge_base_id=kb.id,
data_source_id=data_source.id,
sync_schedule="0 2 * * *", # cron expression: daily at 2:00 AM UTC
)await client.knowledgeBases.dataSources.update({
knowledgeBaseId: kb.id,
dataSourceId: dataSource.id,
syncSchedule: "0 2 * * *", // cron expression: daily at 2:00 AM UTC
});Common cron schedules:
| Schedule | Cron Expression |
|---|---|
| Every hour | 0 * * * * |
| Every 6 hours | 0 */6 * * * |
| Daily at midnight UTC | 0 0 * * * |
| Weekly on Sunday at 3 AM UTC | 0 3 * * 0 |
IAM Permissions
Create an IAM user or role with the minimum permissions required for the S3 connector:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::my-company-docs",
"arn:aws:s3:::my-company-docs/*"
]
}
]
}| Permission | Purpose |
|---|---|
s3:ListBucket | List objects in the bucket to discover files |
s3:GetObject | Download file contents for ingestion |
s3:GetBucketLocation | Determine the bucket’s region |
Best practice: Scope the
Resourceto the specific bucket and prefix you are syncing. Avoid usings3:*orResource: "*".
Troubleshooting
| Problem | Solution |
|---|---|
| ”Access Denied” during sync | Verify the IAM credentials have s3:GetObject and s3:ListBucket on the correct bucket ARN |
| No files found | Check that the prefix matches the folder structure in your bucket (include trailing /) |
| Files skipped | Ensure files have a supported extension (.pdf, .txt, .md, .docx, .html, .csv, .json) |
| Sync stuck in “processing” | Large buckets may take time; check the job’s documents_processed count to confirm progress |
Next Steps
- Database Connector — connect PostgreSQL or MySQL as a data source
- RAG Best Practices — optimize chunking, search, and retrieval
- Knowledge Bases API — full API reference
- Ingestion Jobs API — monitor and manage ingestion