Smoothcomp Ingestion Adapter

This repository is an internal Smoothcomp ingestion adapter implemented in Go. It acts as an Anti-Corruption Layer between Smoothcomp and the future system of record, and owns only provider-facing ingestion concerns:

external fetching
raw snapshot capture
HTML/JSON parsing
technical normalization
source-level change detection
snapshot and publication deduplication
publication decisioning
ingestion job execution
internal operational APIs

It is not a public scraper API and it is not the business-domain backend.

Ownership is intentionally split:

Go owns source-level change detection, snapshot dedupe, normalized-result dedupe, publication dedupe, parser and normalization versioning, and deciding whether a new envelope should be published
Nest.js owns contract validation on receipt, import-run lifecycle, idempotent application to the domain model, canonical persistence, multitenancy, security, and audit of imported business state

Production Execution Model

The supported runtime is split into:

cmd/api: internal control plane, health/readiness, enqueue endpoints, publication lookup
cmd/worker: scheduler plus multi-worker-safe execution loop
cmd/server: local convenience binary only; not the recommended production shape

The job lifecycle is:

API or scheduler enqueues a durable job in ingestion_jobs
A worker claims one available job with a lease
The worker renews the lease while processing
Raw provider responses are stored in raw_snapshots
Technical normalization is stored in normalized_results
Publication decision metadata is stored with the normalized result for every execution
Published importable output is stored in published_results only when Go decides a new effective publication is warranted
The job either completes, gets rescheduled for retry, or reaches a terminal state

Architecture

Canonical layout:

cmd/api
cmd/worker
cmd/server
internal/core
internal/application
internal/adapters/smoothcomp
internal/adapters/storage
internal/adapters/transport
internal/platform/config
internal/platform/bootstrap
migrations
testdata

Dependency direction points inward:

internal/core Contracts, job models, error taxonomy, repository and pipeline ports
internal/application Enqueue, worker lifecycle, retry policy, scheduler orchestration
internal/adapters/* Smoothcomp provider implementation, storage implementation, internal HTTP transport
internal/platform/* Config loading, runtime wiring, correlation helpers

Persistence Model

Production persistence uses Postgres. SQLite remains available as a local development fallback only.

Operational tables:

ingestion_jobs Durable queue record, lease state, retry schedule, error state, versions, counters
job_attempts One row per execution attempt
job_state_transitions Audit trail of lifecycle transitions
raw_snapshots Append-only per job attempt with hash-based idempotency
normalized_results One canonical normalized output per job, updated idempotently by job_id, including scope_key, source_snapshot_hash, normalized_hash, and publication decision audit fields
published_results One published envelope per effective publication, including scope_key, source_snapshot_hash, normalized_hash, envelope_hash, publication decision fields, and supersession lineage
schedule_configs_v2 Internal scheduler configuration

Locking and Leasing Strategy

Production worker claiming uses a lease model:

Postgres claim path uses SELECT ... FOR UPDATE SKIP LOCKED
a claimed job is marked running
claimed_by, claimed_at, lease_until, and last_heartbeat_at are persisted
a worker renews the lease periodically while processing
a job becomes claimable again when:
- it is pending and next_retry_at <= now()
- or it is running but lease_until < now()

This gives:

safe concurrent multi-worker claiming
crash recovery for stuck jobs
no concurrent double-processing of the same active lease

SQLite uses the same repository contract but is intended only for local development, not for multi-worker production.

Retry Model

Retries are durable and explicit.

attempt_count is incremented on every claim
max_attempts is stored on the job
retryable failures move the job back to pending
next_retry_at is persisted using exponential backoff with cap
non-retryable failures end in failed
retryable failures that exhaust attempts end in exhausted
failure category, code, message, and retryability are persisted on the job and attempt records

Current states:

pending
running
succeeded
failed
exhausted

Idempotency and Consistency

Consistency rules in the current implementation:

snapshots are append-only per job attempt, with uniqueness on:
- (job_id, attempt_number, resource_type, resource_key, sha256)
normalized results are idempotent by job_id
published results are idempotent by job_id
latest effective publication is looked up by (pipeline, scope_key) ordered by published_at DESC
normalized results store a canonical normalized_hash
published results store a canonical envelope_hash
every normalized execution stores publication_decision, publication_reason, and change_classification

Publication decisioning is explicit:

NO_CHANGE -> SKIP_NO_CHANGE
CONTENT_CHANGED -> PUBLISH_CHANGED
NORMALIZATION_CHANGED -> PUBLISH_CHANGED
REPUBLISH_FORCED -> PUBLISH_FORCED

The adapter computes:

source_snapshot_hash Stable hash of the fetched provider snapshot set for a single scope
normalized_hash Stable hash of normalized semantic content after removing execution-local volatility such as snapshot ids, job ids, correlation ids, and timestamps
envelope_hash Stable hash of the published envelope after publication-lineage metadata is added, excluding volatile delivery metadata such as published_at

This means a repeated execution of the same job can safely overwrite the canonical normalized record for that job while preserving attempt history, and Go can avoid publishing a redundant envelope when the latest effective publication for the same scope has not materially changed.

Internal API

Active internal endpoints:

GET /internal/v1/health/live
GET /internal/v1/health/ready
POST /internal/v1/jobs
GET /internal/v1/jobs
GET /internal/v1/jobs/{id}
GET /internal/v1/publications/latest?pipeline=...

GET /internal/v1/publications/latest now accepts optional scope filters:

country
event_type
event_id
profile_id

When scope filters are present, the lookup resolves the latest effective publication for that exact provider scope instead of the last publication for the whole pipeline.

The API requires an internal token unless ALLOW_INSECURE_INTERNAL_AUTH=true is explicitly set. CORS is not enabled by default.

Configuration

Important environment variables:

DATABASE_DRIVER=postgres
DATABASE_DSN=postgres://user:password@localhost:5432/smoothcomp_adapter?sslmode=disable
DATABASE_RUN_MIGRATIONS=true
DATABASE_MAX_OPEN_CONNS=10
DATABASE_MAX_IDLE_CONNS=5
DATABASE_CONN_MAX_LIFETIME_SEC=300

WORKER_POLL_INTERVAL_SEC=5
WORKER_LEASE_DURATION_SEC=60
WORKER_HEARTBEAT_INTERVAL_SEC=20
WORKER_MAX_ATTEMPTS=5
WORKER_BASE_RETRY_DELAY_SEC=15
WORKER_MAX_RETRY_DELAY_SEC=300

INTERNAL_AUTH_TOKEN=replace-me

Validation currently enforces:

internal auth token unless insecure mode is explicitly enabled
supported DB driver
worker heartbeat lower than lease duration
positive retry and connection-pool settings

Migrations

Migrations live under:

migrations/postgres
migrations/sqlite

They are executed automatically on startup when DATABASE_RUN_MIGRATIONS=true.

Current migration flow:

open DB
ensure schema_migrations
run unapplied SQL files in lexical order
record applied versions

Local Development

Local fallback mode:

DATABASE_DRIVER=sqlite
DATABASE_DSN=./storage/adapter.db
ALLOW_INSECURE_INTERNAL_AUTH=true

Typical local commands:

go run ./cmd/api
go run ./cmd/worker

cmd/server still exists, but it should be treated as local convenience only.

Supported Pipelines Today

smoothcomp.event_catalog
smoothcomp.event_participants
smoothcomp.event_detail
smoothcomp.athlete_profile_enrichment
smoothcomp.academy_catalog

Fixture-based parser tests live under:

testdata/smoothcomp/events
testdata/smoothcomp/participants
testdata/smoothcomp/event_detail
testdata/smoothcomp/athletes
testdata/smoothcomp/academies
testdata/smoothcomp/audit

Storage lifecycle tests live in:

internal/adapters/storage/gormstore

Extraction Audit

This repository now includes a deterministic extraction audit runner for evidence-based verification of parser and normalization quality.

Artifacts:

dataset: testdata/smoothcomp/audit/dataset.json
raw audit fixtures: testdata/smoothcomp/audit/fixtures
audit runner: cmd/audit
current human-readable report: docs/smoothcomp-extraction-audit.md
match extraction design note: docs/smoothcomp-match-extraction.md

Run it locally:

go run ./cmd/audit
go run ./cmd/audit -format json

The audit compares:

raw provider snapshots
normalized adapter output
manually curated truth assertions with mismatch classification

Mismatch classes currently used:

SOURCE_NOT_VISIBLE
PARTIAL_SOURCE_DATA
PARSER_DRIFT
NORMALIZATION_BUG
ID_RESOLUTION_BUG
SUBDOMAIN_VARIANT
EXPECTATION_WAS_WRONG
UNSUPPORTED_VARIANT

Migration Status

The following legacy extraction capabilities have now been migrated into first-class provider pipelines:

event catalog
event participants
event detail
athlete profile enrichment
academy catalog

The adapter also publishes athlete-centric match history and derived win/loss summaries from profile history where the provider visibly exposes them. Event-centric bracket/results reconstruction has not yet been frozen as a supported audited contract.

Legacy packages remain only as temporary reference for non-runtime helpers and for future cleanup:

internal/scraper
internal/api
internal/scheduler
internal/config

These packages are no longer the supported runtime path for day-to-day operation.

Known Limitations After This Step

Postgres is now wired, but production rollout still needs environment-specific operational packaging and deployment manifests
metrics export is still hook-oriented/log-oriented; there is not yet a Prometheus/OpenTelemetry adapter
the adapter now decides publication at source level, but delivery orchestration into Nest is still a separate next step
provider coverage is broader now, but some Smoothcomp page variants may still require additional fixtures as new markup variations are discovered
the extraction audit is deterministic and reproducible, but it is still based on a curated corpus rather than large-scale live production sampling
athlete competitive records are currently strongest from profile-history sources; event-centric bracket/results pages still need archived fixture coverage before that path should be consumed downstream
cmd/server still exists for convenience and should not be the default production topology
concurrent jobs for the exact same scope can still race at publication time because there is not yet a dedicated per-scope publication lock or delivery outbox

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
bin		bin
cmd		cmd
docs		docs
internal		internal
migrations		migrations
pkg/logger		pkg/logger
testdata/smoothcomp		testdata/smoothcomp
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
test_api.sh		test_api.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smoothcomp Ingestion Adapter

Production Execution Model

Architecture

Persistence Model

Locking and Leasing Strategy

Retry Model

Idempotency and Consistency

Internal API

Configuration

Migrations

Local Development

Supported Pipelines Today

Extraction Audit

Migration Status

Known Limitations After This Step

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Smoothcomp Ingestion Adapter

Production Execution Model

Architecture

Persistence Model

Locking and Leasing Strategy

Retry Model

Idempotency and Consistency

Internal API

Configuration

Migrations

Local Development

Supported Pipelines Today

Extraction Audit

Migration Status

Known Limitations After This Step

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages