FeaturesConnectors

Connectors

Connectors let you bring data from external systems into your Tensoras Knowledge Bases. Each connector type pulls documents from a specific source, handles format conversion, and feeds them into the ingestion pipeline for chunking, embedding, and indexing.

Available Connectors

ConnectorType ValueDescription
File Uploadfile_uploadUpload files directly via the Files API
Web Crawlweb_crawlCrawl a website starting from a seed URL
Amazon S3s3Sync files from an S3 bucket
Google Cloud StoragegcsSync files from a GCS bucket
ConfluenceconfluenceSync pages from Atlassian Confluence
NotionnotionSync pages from a Notion workspace
Google Drivegoogle_driveSync files from Google Drive

File Upload

The simplest connector. Upload files using the Files API, then reference the file IDs when creating a data source.

Supported formats: PDF, DOCX, TXT, Markdown, HTML, CSV, JSON, PPTX, XLSX

from tensoras import Tensoras
 
client = Tensoras(api_key="tns_your_key_here")
 
# Upload a file
file = client.files.create(
    file=open("product-manual.pdf", "rb"),
    purpose="knowledge_base",
)
 
# Add it to a Knowledge Base
data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="file_upload",
    file_ids=[file.id],
)

Web Crawl

Crawl a website starting from a seed URL. Tensoras follows links up to a configurable depth and ingests the page content.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="web_crawl",
    config={
        "seed_url": "https://docs.example.com",
        "max_depth": 3,
        "max_pages": 500,
        "include_patterns": ["https://docs.example.com/*"],
        "exclude_patterns": ["*/changelog*", "*/archive*"],
    },
)
ParameterDefaultDescription
seed_urlRequiredThe starting URL for the crawl
max_depth3Maximum link depth from the seed URL
max_pages500Maximum number of pages to crawl
include_patterns[]Glob patterns — only crawl URLs matching these patterns
exclude_patterns[]Glob patterns — skip URLs matching these patterns

Amazon S3

Connect an S3 bucket to sync documents. You provide AWS credentials and optionally filter by prefix.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="s3",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "region": "us-east-1",
        "aws_access_key_id": "AKIA...",
        "aws_secret_access_key": "...",
    },
)
ParameterDefaultDescription
bucketRequiredS3 bucket name
prefix""Only sync objects with this key prefix
region"us-east-1"AWS region
aws_access_key_idRequiredAWS access key
aws_secret_access_keyRequiredAWS secret key

See the S3 Connector Guide for detailed setup instructions including IAM policies.

Google Cloud Storage

Connect a GCS bucket to sync documents.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="gcs",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "service_account_json": "{...}",
    },
)
ParameterDefaultDescription
bucketRequiredGCS bucket name
prefix""Only sync objects with this key prefix
service_account_jsonRequiredGCP service account credentials as a JSON string

Confluence

Sync pages from Atlassian Confluence. You can filter by space key to sync specific spaces.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="confluence",
    config={
        "url": "https://mycompany.atlassian.net",
        "username": "user@company.com",
        "api_token": "...",
        "space_keys": ["ENG", "PRODUCT"],
    },
)
ParameterDefaultDescription
urlRequiredConfluence instance URL
usernameRequiredConfluence username (email)
api_tokenRequiredConfluence API token
space_keys[] (all spaces)List of space keys to sync

Notion

Sync pages from a Notion workspace. Requires a Notion integration token.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="notion",
    config={
        "api_token": "secret_...",
        "root_page_ids": ["page-id-1", "page-id-2"],
    },
)
ParameterDefaultDescription
api_tokenRequiredNotion integration token
root_page_ids[] (all accessible pages)List of root page IDs to sync (includes child pages)

Google Drive

Sync files from Google Drive folders.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="google_drive",
    config={
        "service_account_json": "{...}",
        "folder_ids": ["folder-id-1", "folder-id-2"],
    },
)
ParameterDefaultDescription
service_account_jsonRequiredGCP service account credentials as a JSON string
folder_ids[] (all accessible files)List of Google Drive folder IDs to sync

Node.js Examples

All connectors work identically in Node.js. Here is the web crawl connector as an example:

import Tensoras from "tensoras";
 
const client = new Tensoras({ apiKey: "tns_your_key_here" });
 
const dataSource = await client.knowledgeBases.dataSources.create({
  knowledgeBaseId: "kb_abc123",
  type: "web_crawl",
  config: {
    seedUrl: "https://docs.example.com",
    maxDepth: 3,
    maxPages: 500,
    includePatterns: ["https://docs.example.com/*"],
    excludePatterns: ["*/changelog*"],
  },
});

Sync Schedules

Connectors that pull from external systems (S3, GCS, Confluence, Notion, Google Drive, web crawl) support automatic sync schedules. When configured, Tensoras periodically re-syncs the data source, picking up new and updated documents.

data_source = client.knowledge_bases.data_sources.create(
    knowledge_base_id="kb_abc123",
    type="s3",
    config={
        "bucket": "my-company-docs",
        "prefix": "engineering/",
        "aws_access_key_id": "AKIA...",
        "aws_secret_access_key": "...",
    },
    sync_schedule={
        "frequency": "daily",    # "hourly", "daily", "weekly"
        "time": "02:00",         # UTC time for daily/weekly syncs
    },
)

You can also trigger a manual sync at any time:

client.knowledge_bases.data_sources.sync(
    knowledge_base_id="kb_abc123",
    data_source_id=data_source.id,
)

Incremental Updates

Syncs are incremental by default. Tensoras tracks which documents have been ingested and only processes new or modified documents on subsequent syncs. Deleted documents are removed from the index.

Status Tracking

Every sync creates an ingestion job that you can monitor:

jobs = client.ingestion_jobs.list(
    knowledge_base_id="kb_abc123",
)
 
for job in jobs.data:
    print(f"Job: {job.id}")
    print(f"Status: {job.status}")         # queued, processing, completed, failed
    print(f"Documents: {job.documents_processed}/{job.documents_total}")
    print()

See the Ingestion Jobs API for full details on tracking ingestion progress.