Connectors
Connectors let you bring data from external systems into your Tensoras Knowledge Bases. Each connector type pulls documents from a specific source, handles format conversion, and feeds them into the ingestion pipeline for chunking, embedding, and indexing.
Available Connectors
| Connector | Type Value | Description |
|---|---|---|
| File Upload | file_upload | Upload files directly via the Files API |
| Web Crawl | web_crawl | Crawl a website starting from a seed URL |
| Amazon S3 | s3 | Sync files from an S3 bucket |
| Google Cloud Storage | gcs | Sync files from a GCS bucket |
| Confluence | confluence | Sync pages from Atlassian Confluence |
| Notion | notion | Sync pages from a Notion workspace |
| Google Drive | google_drive | Sync files from Google Drive |
File Upload
The simplest connector. Upload files using the Files API, then reference the file IDs when creating a data source.
Supported formats: PDF, DOCX, TXT, Markdown, HTML, CSV, JSON, PPTX, XLSX
from tensoras import Tensoras
client = Tensoras(api_key="tns_your_key_here")
# Upload a file
file = client.files.create(
file=open("product-manual.pdf", "rb"),
purpose="knowledge_base",
)
# Add it to a Knowledge Base
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="file_upload",
file_ids=[file.id],
)Web Crawl
Crawl a website starting from a seed URL. Tensoras follows links up to a configurable depth and ingests the page content.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="web_crawl",
config={
"seed_url": "https://docs.example.com",
"max_depth": 3,
"max_pages": 500,
"include_patterns": ["https://docs.example.com/*"],
"exclude_patterns": ["*/changelog*", "*/archive*"],
},
)| Parameter | Default | Description |
|---|---|---|
seed_url | Required | The starting URL for the crawl |
max_depth | 3 | Maximum link depth from the seed URL |
max_pages | 500 | Maximum number of pages to crawl |
include_patterns | [] | Glob patterns — only crawl URLs matching these patterns |
exclude_patterns | [] | Glob patterns — skip URLs matching these patterns |
Amazon S3
Connect an S3 bucket to sync documents. You provide AWS credentials and optionally filter by prefix.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="s3",
config={
"bucket": "my-company-docs",
"prefix": "engineering/",
"region": "us-east-1",
"aws_access_key_id": "AKIA...",
"aws_secret_access_key": "...",
},
)| Parameter | Default | Description |
|---|---|---|
bucket | Required | S3 bucket name |
prefix | "" | Only sync objects with this key prefix |
region | "us-east-1" | AWS region |
aws_access_key_id | Required | AWS access key |
aws_secret_access_key | Required | AWS secret key |
See the S3 Connector Guide for detailed setup instructions including IAM policies.
Google Cloud Storage
Connect a GCS bucket to sync documents.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="gcs",
config={
"bucket": "my-company-docs",
"prefix": "engineering/",
"service_account_json": "{...}",
},
)| Parameter | Default | Description |
|---|---|---|
bucket | Required | GCS bucket name |
prefix | "" | Only sync objects with this key prefix |
service_account_json | Required | GCP service account credentials as a JSON string |
Confluence
Sync pages from Atlassian Confluence. You can filter by space key to sync specific spaces.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="confluence",
config={
"url": "https://mycompany.atlassian.net",
"username": "user@company.com",
"api_token": "...",
"space_keys": ["ENG", "PRODUCT"],
},
)| Parameter | Default | Description |
|---|---|---|
url | Required | Confluence instance URL |
username | Required | Confluence username (email) |
api_token | Required | Confluence API token |
space_keys | [] (all spaces) | List of space keys to sync |
Notion
Sync pages from a Notion workspace. Requires a Notion integration token.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="notion",
config={
"api_token": "secret_...",
"root_page_ids": ["page-id-1", "page-id-2"],
},
)| Parameter | Default | Description |
|---|---|---|
api_token | Required | Notion integration token |
root_page_ids | [] (all accessible pages) | List of root page IDs to sync (includes child pages) |
Google Drive
Sync files from Google Drive folders.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="google_drive",
config={
"service_account_json": "{...}",
"folder_ids": ["folder-id-1", "folder-id-2"],
},
)| Parameter | Default | Description |
|---|---|---|
service_account_json | Required | GCP service account credentials as a JSON string |
folder_ids | [] (all accessible files) | List of Google Drive folder IDs to sync |
Node.js Examples
All connectors work identically in Node.js. Here is the web crawl connector as an example:
import Tensoras from "tensoras";
const client = new Tensoras({ apiKey: "tns_your_key_here" });
const dataSource = await client.knowledgeBases.dataSources.create({
knowledgeBaseId: "kb_abc123",
type: "web_crawl",
config: {
seedUrl: "https://docs.example.com",
maxDepth: 3,
maxPages: 500,
includePatterns: ["https://docs.example.com/*"],
excludePatterns: ["*/changelog*"],
},
});Sync Schedules
Connectors that pull from external systems (S3, GCS, Confluence, Notion, Google Drive, web crawl) support automatic sync schedules. When configured, Tensoras periodically re-syncs the data source, picking up new and updated documents.
data_source = client.knowledge_bases.data_sources.create(
knowledge_base_id="kb_abc123",
type="s3",
config={
"bucket": "my-company-docs",
"prefix": "engineering/",
"aws_access_key_id": "AKIA...",
"aws_secret_access_key": "...",
},
sync_schedule={
"frequency": "daily", # "hourly", "daily", "weekly"
"time": "02:00", # UTC time for daily/weekly syncs
},
)You can also trigger a manual sync at any time:
client.knowledge_bases.data_sources.sync(
knowledge_base_id="kb_abc123",
data_source_id=data_source.id,
)Incremental Updates
Syncs are incremental by default. Tensoras tracks which documents have been ingested and only processes new or modified documents on subsequent syncs. Deleted documents are removed from the index.
Status Tracking
Every sync creates an ingestion job that you can monitor:
jobs = client.ingestion_jobs.list(
knowledge_base_id="kb_abc123",
)
for job in jobs.data:
print(f"Job: {job.id}")
print(f"Status: {job.status}") # queued, processing, completed, failed
print(f"Documents: {job.documents_processed}/{job.documents_total}")
print()See the Ingestion Jobs API for full details on tracking ingestion progress.
Related
- RAG Overview — end-to-end RAG pipeline
- Chunking Strategies — how documents are split after ingestion
- Knowledge Bases API — create and manage Knowledge Bases
- Data Sources API — full data source endpoint reference
- Ingestion Jobs API — track ingestion progress
- Files API — upload files for file_upload connector
- S3 Connector Guide — detailed S3 setup instructions