Skip to content

tractorbeamai/snowflake-reducto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Reducto Document Parsing Pipeline

A Snowflake-native pipeline that automatically ingests documents from an external stage, parses them with Reducto, and surfaces the raw chunks plus a full-document markdown view.

Everything runs inside Snowflake — no external orchestrator, no dbt, no Python environment on your machine.

What this project does

  1. Detects new PDF uploads on a Snowflake external stage via a directory table stream.
  2. Sends each file to Reducto for async document parsing (via Python UDFs with external network access).
  3. Polls Reducto for completed results on a 5-minute schedule.
  4. Stores raw parse results (chunks, metadata, usage) in collected_results.
  5. Provides a document_markdown_v view that assembles chunks into full-document markdown text.
  6. Alerts on stale jobs and logs warnings to a pipeline_warnings table.

How it works

S3 bucket
  └── External stage (directory table, auto-refresh)
        └── Stream (new_documents_stream)
              └── Task: submit_task (triggered by stream)
                    └── submitted_jobs table

submitted_jobs
  └── Task: collect_task (every 5 min, polls Reducto API)
        └── collected_results table
              └── View: document_markdown_v (full-doc markdown)

submitted_jobs + collected_results
  └── Alert: stale_pending_jobs (hourly)
        └── pipeline_warnings table

Prerequisites

  • A Snowflake account with ACCOUNTADMIN or equivalent privileges (needed for creating network rules, secrets, integrations, UDFs, and tasks).
  • A Reducto API key — sign up at reducto.ai and generate one.
  • An S3 bucket (or other cloud storage) that Snowflake can access as an external stage, where you will upload PDFs.
  • An existing external stage in Snowflake pointing to that bucket.

Step 1 — Configure setup.sql

Open setup.sql in a Snowflake worksheet (or your SQL editor of choice).

Search for >>> CONFIGURE comments and replace the placeholder values:

Placeholder What to set
Database name The database to create objects in (default: raw)
Schema name The schema to use (default: documents)
SECRET_STRING Your Reducto API key (or update it after running — see Step 3)
Stage name Your external stage name if different from raw.documents.document_stage

Step 2 — Run the setup script

Select all and run setup.sql in a Snowflake worksheet. It creates everything in order:

  1. Network rule, secret, and external access integration for Reducto API calls
  2. Directory table auto-refresh and stream on the external stage
  3. Python UDFs (reducto_submit, reducto_collect)
  4. Tables (submitted_jobs, collected_results, pipeline_warnings)
  5. Stored procedures (submit_new_documents, collect_results)
  6. A view (document_markdown_v) assembling chunks into full-document markdown
  7. Two serverless tasks (submit, collect) — resumed automatically
  8. One data quality alert — resumed automatically

Step 3 — Set your Reducto API key

If you didn't set it in Step 1, update the secret now:

ALTER SECRET raw.documents.reducto_api_key
    SET SECRET_STRING = 'your-actual-reducto-api-key';

Step 4 — Configure auto-refresh (S3)

So that new files are detected automatically:

  1. Run DESC STAGE raw.documents.document_stage; and copy the SQS queue ARN from directory_notification_channel.
  2. In AWS, set up an SNS topic (or direct SQS subscription) on your S3 bucket for s3:ObjectCreated:* events, pointing to that SQS ARN.

See Snowflake docs on auto-refreshing directory tables for details.

For Azure or GCP storage, follow the equivalent Snowflake guide for your cloud provider.

Step 5 — Upload documents

Upload PDFs to your S3 bucket (or whatever storage backs the stage). The pipeline picks them up automatically:

  1. ~30 seconds: stream detects new files, submit_task fires, files are submitted to Reducto
  2. ~1-2 minutes: Reducto finishes parsing
  3. Next collect cycle (up to 5 min): collect_task picks up completed results

Total end-to-end latency: ~2-7 minutes per document.

Querying the output

Full-document markdown:

SELECT source_path, num_pages, full_text
FROM raw.documents.document_markdown_v
ORDER BY collected_at DESC;

Raw chunks (for fine-grained access):

SELECT
    c.source_path,
    f.index AS chunk_index,
    f.value:content::string AS chunk_text
FROM raw.documents.collected_results c,
    LATERAL FLATTEN(input => c.parse_result:chunks) f
ORDER BY c.collected_at DESC, f.index;

Check data quality warnings:

SELECT * FROM raw.documents.pipeline_warnings
ORDER BY detected_at DESC;

Monitoring

Task execution history

SELECT task_name, state, scheduled_time, completed_time, error_message
FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY(
    RESULT_LIMIT => 50
))
ORDER BY scheduled_time DESC;

Pipeline throughput

SELECT
    DATE_TRUNC('hour', collected_at)    AS hour,
    COUNT(*)                            AS documents_processed,
    SUM(credits_used)                   AS total_reducto_credits
FROM raw.documents.collected_results
GROUP BY 1
ORDER BY 1 DESC;

Pending jobs (not yet collected)

SELECT s.job_id, s.source_path, s.submitted_at,
       DATEDIFF('minute', s.submitted_at, CURRENT_TIMESTAMP()) AS minutes_pending
FROM raw.documents.submitted_jobs s
LEFT JOIN raw.documents.collected_results c ON s.job_id = c.job_id
WHERE c.job_id IS NULL
ORDER BY s.submitted_at DESC;

Backfill (processing many documents at once)

For the initial load of many documents, the pipeline processes them through the normal stream + task flow. To speed things up:

Shorten the collect interval temporarily:

ALTER TASK collect_task SUSPEND;
ALTER TASK collect_task SET SCHEDULE = '1 MINUTE';
ALTER TASK collect_task RESUME;

After the backfill completes, restore the default:

ALTER TASK collect_task SUSPEND;
ALTER TASK collect_task SET SCHEDULE = '5 MINUTE';
ALTER TASK collect_task RESUME;

Or run the procedures manually in a loop:

CALL submit_new_documents();
-- wait a few minutes for Reducto to process
CALL collect_results();

Troubleshooting

Tasks are not running:

  • Check that tasks are resumed: SHOW TASKS; and look at the state column.
  • Check task history for errors: SELECT * FROM TABLE(INFORMATION_SCHEMA.TASK_HISTORY()) ORDER BY SCHEDULED_TIME DESC LIMIT 10;
  • Tasks auto-suspend after 10 consecutive failures. Resume with ALTER TASK <name> RESUME;

submit_task never fires:

  • The stream may be empty. Upload a file, then manually refresh: ALTER STAGE raw.documents.document_stage REFRESH;
  • Verify the stream has data: SELECT SYSTEM$STREAM_HAS_DATA('raw.documents.new_documents_stream');

collect_results always returns 0 jobs:

  • Reducto jobs may still be processing. Wait a few minutes.
  • Check that your API key is set: SELECT * FROM raw.documents.reducto_api_key; (requires ACCOUNTADMIN).

Data quality alert is not firing:

  • Check alert state: SHOW ALERTS;
  • Check alert history: SELECT * FROM TABLE(INFORMATION_SCHEMA.ALERT_HISTORY(RESULT_LIMIT => 20)) ORDER BY SCHEDULED_TIME DESC;

Teardown

To remove all pipeline objects (preserves the external stage):

-- Run teardown.sql in a Snowflake worksheet

See teardown.sql for the full script.

Project structure

├── setup.sql       # Creates all Snowflake objects (run once)
├── teardown.sql    # Drops all pipeline objects (for reset/cleanup)
└── README.md

Objects created

Type Name Purpose
Network Rule reducto_egress Allows egress to platform.reducto.ai
Secret reducto_api_key Stores the Reducto API key
Integration reducto_access Wires the network rule and secret for UDFs
Stream new_documents_stream Detects new files on the stage
Function reducto_submit(url) Submits a document to Reducto for async parsing
Function reducto_collect(job_id) Polls Reducto for a completed parse result
Table submitted_jobs Tracks files submitted to Reducto
Table collected_results Stores raw Reducto parse results (chunks + metadata)
Table pipeline_warnings Data quality warnings from alerts
Procedure submit_new_documents() Reads stream, submits new files
Procedure collect_results() Polls for completed Reducto jobs
View document_markdown_v Full-document markdown assembled from chunks
Task submit_task Stream-triggered, calls submit_new_documents()
Task collect_task Scheduled every 5 min, calls collect_results()
Alert stale_pending_jobs_alert Hourly check for stuck Reducto jobs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors