Connectors

Connectors let you bring data from external systems into your Tensoras Knowledge Bases. Each connector type pulls documents from a specific source, handles format conversion, and feeds them into the ingestion pipeline for chunking, embedding, and indexing.

Available Connectors

Connector	Type Value	Description
File Upload	`file_upload`	Upload files directly via the Files API
Web Crawl	`web_crawl`	Crawl a website starting from a seed URL
Amazon S3	`s3`	Sync files from an S3 bucket
Google Cloud Storage	`gcs`	Sync files from a GCS bucket
Confluence	`confluence`	Sync pages from Atlassian Confluence
Notion	`notion`	Sync pages from a Notion workspace
Google Drive	`google_drive`	Sync files from Google Drive

File Upload

The simplest connector. Upload files using the Files API, then reference the file IDs when creating a data source.

Supported formats: PDF, DOCX, TXT, Markdown, HTML, CSV, JSON, PPTX, XLSX

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
# Upload a file
file = client.files.create(
    file=open("product-manual.pdf", "rb"),
    purpose="knowledge_base",
)
 
# Add it to a Knowledge Base
data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="file_upload",
    file_ids=[file.id],
)

Web Crawl

Crawl a website starting from a seed URL. Tensoras follows links up to a configurable depth and ingests the page content.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="web_crawl",
    config={
        "seed_url": "https://docs.example.com",
        "max_depth": 3,
        "max_pages": 500,
        "include_patterns": ["https://docs.example.com/*"],
        "exclude_patterns": ["*/changelog*", "*/archive*"],
    },
)

Parameter	Default	Description
`seed_url`	Required	The starting URL for the crawl
`max_depth`	`3`	Maximum link depth from the seed URL
`max_pages`	`500`	Maximum number of pages to crawl
`include_patterns`	`[]`	Glob patterns — only crawl URLs matching these patterns
`exclude_patterns`	`[]`	Glob patterns — skip URLs matching these patterns

Amazon S3

Connect an S3 bucket to sync documents. You provide AWS credentials and optionally filter by prefix.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="s3",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "region": "us-east-1",
        "aws_access_key_id": "AKIA...",
        "aws_secret_access_key": "...",
    },
)

Parameter	Default	Description
`bucket`	Required	S3 bucket name
`prefix`	`""`	Only sync objects with this key prefix
`region`	`"us-east-1"`	AWS region
`aws_access_key_id`	Required	AWS access key
`aws_secret_access_key`	Required	AWS secret key

See the S3 Connector Guide for detailed setup instructions including IAM policies.

Google Cloud Storage

Connect a GCS bucket to sync documents.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="gcs",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "service_account_json": "{...}",
    },
)

Parameter	Default	Description
`bucket`	Required	GCS bucket name
`prefix`	`""`	Only sync objects with this key prefix
`service_account_json`	Required	GCP service account credentials as a JSON string

Confluence

Sync pages from Atlassian Confluence. You can filter by space key to sync specific spaces.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="confluence",
    config={
        "url": "https://mycompany.atlassian.net",
        "username": "user@company.com",
        "api_token": "...",
        "space_keys": ["ENG", "PRODUCT"],
    },
)

Parameter	Default	Description
`url`	Required	Confluence instance URL
`username`	Required	Confluence username (email)
`api_token`	Required	Confluence API token
`space_keys`	`[]` (all spaces)	List of space keys to sync

Notion

Sync pages from a Notion workspace. Requires a Notion integration token.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="notion",
    config={
        "api_token": "secret_...",
        "root_page_ids": ["page-id-1", "page-id-2"],
    },
)

Parameter	Default	Description
`api_token`	Required	Notion integration token
`root_page_ids`	`[]` (all accessible pages)	List of root page IDs to sync (includes child pages)

Google Drive

Sync files from Google Drive folders.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="google_drive",
    config={
        "service_account_json": "{...}",
        "folder_ids": ["folder-id-1", "folder-id-2"],
    },
)

Parameter	Default	Description
`service_account_json`	Required	GCP service account credentials as a JSON string
`folder_ids`	`[]` (all accessible files)	List of Google Drive folder IDs to sync

Node.js Examples

All connectors work identically in Node.js. Here is the web crawl connector as an example:

import Tensoras from "tensoras";
 
const client = new Tensoras({ apiKey: "tns_your_key_here" });
 
const dataSource = await client.knowledgeBases.dataSources.create({
  knowledgeBaseId: "kb_abc123",
  type: "web_crawl",
  config: {
    seedUrl: "https://docs.example.com",
    maxDepth: 3,
    maxPages: 500,
    includePatterns: ["https://docs.example.com/*"],
    excludePatterns: ["*/changelog*"],
  },
});

Sync Schedules

Connectors that pull from external systems (S3, GCS, Confluence, Notion, Google Drive, web crawl) support automatic sync schedules. When configured, Tensoras periodically re-syncs the data source, picking up new and updated documents.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="s3",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "aws_access_key_id": "AKIA...",
        "aws_secret_access_key": "...",
    },
    sync_schedule={
        "frequency": "daily",    # "hourly", "daily", "weekly"
        "time": "02:00",         # UTC time for daily/weekly syncs
    },
)

You can also trigger a manual sync at any time:

client.knowledge_bases.data_sources.sync(
    knowledge_base_id="kb_abc123",
    data_source_id=data_source.id,
)

Incremental Updates

Syncs are incremental by default. Tensoras tracks which documents have been ingested and only processes new or modified documents on subsequent syncs. Deleted documents are removed from the index.

Status Tracking

Every sync creates an ingestion job that you can monitor:

jobs = client.ingestion_jobs.list(
    knowledge_base_id="kb_abc123",
)
 
for job in jobs.data:
    print(f"Job: {job.id}")
    print(f"Status: {job.status}")         # queued, processing, completed, failed
    print(f"Documents: {job.documents_processed}/{job.documents_total}")
    print()

See the Ingestion Jobs API for full details on tracking ingestion progress.

RAG Overview — end-to-end RAG pipeline
Chunking Strategies — how documents are split after ingestion
Knowledge Bases API — create and manage Knowledge Bases
Data Sources API — full data source endpoint reference
Ingestion Jobs API — track ingestion progress
Files API — upload files for file_upload connector
S3 Connector Guide — detailed S3 setup instructions

Citations Python