GuidesS3 Connector

S3 Connector

The S3 connector lets you use an Amazon S3 bucket as a data source for a Tensoras Knowledge Base. Documents in the bucket are automatically ingested, chunked, embedded, and indexed so you can query them via RAG.

Overview

The S3 connector supports:

  • Automatic ingestion of files from any S3 bucket or prefix
  • Incremental sync — only new or modified files are re-ingested on subsequent syncs
  • Scheduled sync — set a recurring schedule to keep your Knowledge Base up to date
  • Multiple file types — PDF, TXT, MD, DOCX, HTML, CSV, and JSON

Supported File Types

ExtensionFormat
.pdfPDF documents
.txtPlain text
.mdMarkdown
.docxMicrosoft Word
.htmlHTML pages
.csvComma-separated values
.jsonJSON documents

Files with unsupported extensions are skipped during ingestion.

Setup

Step 1: Create a Knowledge Base

If you do not already have a Knowledge Base, create one first:

from tensoras import Tensoras
 
client = Tensoras()
 
kb = client.knowledge_bases.create(
    name="company-docs",
    description="Internal company documentation from S3",
)
 
print(kb.id)  # e.g. "kb_a1b2c3d4"
import Tensoras from "tensoras";
 
const client = new Tensoras();
 
const kb = await client.knowledgeBases.create({
  name: "company-docs",
  description: "Internal company documentation from S3",
});
 
console.log(kb.id); // e.g. "kb_a1b2c3d4"

Step 2: Add the S3 Data Source

Connect your S3 bucket by providing the bucket name, an optional prefix (folder path), and AWS credentials:

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id=kb.id,
    type="s3",
    config={
        "bucket": "my-company-docs",
        "prefix": "knowledge-base/",        # optional, limits to a folder
        "region": "us-east-1",
        "access_key_id": "AKIA...",
        "secret_access_key": "wJalr...",
    },
)
 
print(data_source.id)  # e.g. "ds_x1y2z3"
const dataSource = await client.knowledgeBases.dataSources.create({
  knowledgeBaseId: kb.id,
  type: "s3",
  config: {
    bucket: "my-company-docs",
    prefix: "knowledge-base/",        // optional, limits to a folder
    region: "us-east-1",
    accessKeyId: "AKIA...",
    secretAccessKey: "wJalr...",
  },
});
 
console.log(dataSource.id); // e.g. "ds_x1y2z3"
curl -X POST https://api.tensoras.ai/v1/knowledge_bases/kb_a1b2c3d4/data_sources \
  -H "Authorization: Bearer $TENSORAS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "s3",
    "config": {
      "bucket": "my-company-docs",
      "prefix": "knowledge-base/",
      "region": "us-east-1",
      "access_key_id": "AKIA...",
      "secret_access_key": "wJalr..."
    }
  }'

Security: AWS credentials are encrypted at rest. For improved security, use an IAM role with minimal permissions (see IAM Permissions below) and rotate credentials regularly.

Step 3: Trigger a Sync

After creating the data source, trigger the initial sync to start ingesting files:

job = client.knowledge_bases.data_sources.sync(
    knowledge_base_id=kb.id,
    data_source_id=data_source.id,
)
 
print(job.id)      # e.g. "job_m1n2o3"
print(job.status)  # "processing"
const job = await client.knowledgeBases.dataSources.sync({
  knowledgeBaseId: kb.id,
  dataSourceId: dataSource.id,
});
 
console.log(job.id);     // e.g. "job_m1n2o3"
console.log(job.status); // "processing"

Step 4: Monitor the Ingestion Job

Poll the ingestion job status or check it in the Console:

import time
 
while True:
    job = client.ingestion_jobs.retrieve(job_id=job.id)
    print(f"Status: {job.status}, Documents: {job.documents_processed}/{job.documents_total}")
 
    if job.status in ("completed", "failed"):
        break
 
    time.sleep(5)
let status = "processing";
while (status === "processing") {
  const updatedJob = await client.ingestionJobs.retrieve({ jobId: job.id });
  console.log(`Status: ${updatedJob.status}, Documents: ${updatedJob.documentsProcessed}/${updatedJob.documentsTotal}`);
  status = updatedJob.status;
 
  if (status === "processing") {
    await new Promise((resolve) => setTimeout(resolve, 5000));
  }
}

You can also monitor ingestion jobs in Console > Knowledge Bases > [Your KB] > Ingestion Jobs.

Step 5: Query with RAG

Once ingestion is complete, query your Knowledge Base in a chat completion:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is our vacation policy?"},
    ],
    knowledge_bases=[kb.id],
)
 
print(response.choices[0].message.content)

Incremental Sync

After the initial full sync, subsequent syncs are incremental:

  • New files in the bucket/prefix are ingested and added to the index.
  • Modified files (detected by S3 ETag or LastModified timestamp) are re-ingested and their old chunks are replaced.
  • Deleted files are removed from the index.

This means syncing a bucket with 10,000 files where only 5 have changed will process only those 5 files.

Scheduled Sync

Set up automatic periodic syncs to keep your Knowledge Base current:

client.knowledge_bases.data_sources.update(
    knowledge_base_id=kb.id,
    data_source_id=data_source.id,
    sync_schedule="0 2 * * *",  # cron expression: daily at 2:00 AM UTC
)
await client.knowledgeBases.dataSources.update({
  knowledgeBaseId: kb.id,
  dataSourceId: dataSource.id,
  syncSchedule: "0 2 * * *", // cron expression: daily at 2:00 AM UTC
});

Common cron schedules:

ScheduleCron Expression
Every hour0 * * * *
Every 6 hours0 */6 * * *
Daily at midnight UTC0 0 * * *
Weekly on Sunday at 3 AM UTC0 3 * * 0

IAM Permissions

Create an IAM user or role with the minimum permissions required for the S3 connector:

IAM Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::my-company-docs",
        "arn:aws:s3:::my-company-docs/*"
      ]
    }
  ]
}
PermissionPurpose
s3:ListBucketList objects in the bucket to discover files
s3:GetObjectDownload file contents for ingestion
s3:GetBucketLocationDetermine the bucket’s region

Best practice: Scope the Resource to the specific bucket and prefix you are syncing. Avoid using s3:* or Resource: "*".

Troubleshooting

ProblemSolution
”Access Denied” during syncVerify the IAM credentials have s3:GetObject and s3:ListBucket on the correct bucket ARN
No files foundCheck that the prefix matches the folder structure in your bucket (include trailing /)
Files skippedEnsure files have a supported extension (.pdf, .txt, .md, .docx, .html, .csv, .json)
Sync stuck in “processing”Large buckets may take time; check the job’s documents_processed count to confirm progress

Next Steps