This repository is an internal Smoothcomp ingestion adapter implemented in Go. It acts as an Anti-Corruption Layer between Smoothcomp and the future system of record, and owns only provider-facing ingestion concerns:
- external fetching
- raw snapshot capture
- HTML/JSON parsing
- technical normalization
- source-level change detection
- snapshot and publication deduplication
- publication decisioning
- ingestion job execution
- internal operational APIs
It is not a public scraper API and it is not the business-domain backend.
Ownership is intentionally split:
- Go owns source-level change detection, snapshot dedupe, normalized-result dedupe, publication dedupe, parser and normalization versioning, and deciding whether a new envelope should be published
- Nest.js owns contract validation on receipt, import-run lifecycle, idempotent application to the domain model, canonical persistence, multitenancy, security, and audit of imported business state
The supported runtime is split into:
cmd/api: internal control plane, health/readiness, enqueue endpoints, publication lookupcmd/worker: scheduler plus multi-worker-safe execution loopcmd/server: local convenience binary only; not the recommended production shape
The job lifecycle is:
- API or scheduler enqueues a durable job in
ingestion_jobs - A worker claims one available job with a lease
- The worker renews the lease while processing
- Raw provider responses are stored in
raw_snapshots - Technical normalization is stored in
normalized_results - Publication decision metadata is stored with the normalized result for every execution
- Published importable output is stored in
published_resultsonly when Go decides a new effective publication is warranted - The job either completes, gets rescheduled for retry, or reaches a terminal state
Canonical layout:
cmd/api
cmd/worker
cmd/server
internal/core
internal/application
internal/adapters/smoothcomp
internal/adapters/storage
internal/adapters/transport
internal/platform/config
internal/platform/bootstrap
migrations
testdata
Dependency direction points inward:
internal/coreContracts, job models, error taxonomy, repository and pipeline portsinternal/applicationEnqueue, worker lifecycle, retry policy, scheduler orchestrationinternal/adapters/*Smoothcomp provider implementation, storage implementation, internal HTTP transportinternal/platform/*Config loading, runtime wiring, correlation helpers
Production persistence uses Postgres. SQLite remains available as a local development fallback only.
Operational tables:
ingestion_jobsDurable queue record, lease state, retry schedule, error state, versions, countersjob_attemptsOne row per execution attemptjob_state_transitionsAudit trail of lifecycle transitionsraw_snapshotsAppend-only per job attempt with hash-based idempotencynormalized_resultsOne canonical normalized output per job, updated idempotently byjob_id, includingscope_key,source_snapshot_hash,normalized_hash, and publication decision audit fieldspublished_resultsOne published envelope per effective publication, includingscope_key,source_snapshot_hash,normalized_hash,envelope_hash, publication decision fields, and supersession lineageschedule_configs_v2Internal scheduler configuration
Production worker claiming uses a lease model:
- Postgres claim path uses
SELECT ... FOR UPDATE SKIP LOCKED - a claimed job is marked
running claimed_by,claimed_at,lease_until, andlast_heartbeat_atare persisted- a worker renews the lease periodically while processing
- a job becomes claimable again when:
- it is
pendingandnext_retry_at <= now() - or it is
runningbutlease_until < now()
- it is
This gives:
- safe concurrent multi-worker claiming
- crash recovery for stuck jobs
- no concurrent double-processing of the same active lease
SQLite uses the same repository contract but is intended only for local development, not for multi-worker production.
Retries are durable and explicit.
attempt_countis incremented on every claimmax_attemptsis stored on the job- retryable failures move the job back to
pending next_retry_atis persisted using exponential backoff with cap- non-retryable failures end in
failed - retryable failures that exhaust attempts end in
exhausted - failure category, code, message, and retryability are persisted on the job and attempt records
Current states:
pendingrunningsucceededfailedexhausted
Consistency rules in the current implementation:
- snapshots are append-only per job attempt, with uniqueness on:
(job_id, attempt_number, resource_type, resource_key, sha256)
- normalized results are idempotent by
job_id - published results are idempotent by
job_id - latest effective publication is looked up by
(pipeline, scope_key)ordered bypublished_at DESC - normalized results store a canonical
normalized_hash - published results store a canonical
envelope_hash - every normalized execution stores
publication_decision,publication_reason, andchange_classification
Publication decisioning is explicit:
NO_CHANGE->SKIP_NO_CHANGECONTENT_CHANGED->PUBLISH_CHANGEDNORMALIZATION_CHANGED->PUBLISH_CHANGEDREPUBLISH_FORCED->PUBLISH_FORCED
The adapter computes:
source_snapshot_hashStable hash of the fetched provider snapshot set for a single scopenormalized_hashStable hash of normalized semantic content after removing execution-local volatility such as snapshot ids, job ids, correlation ids, and timestampsenvelope_hashStable hash of the published envelope after publication-lineage metadata is added, excluding volatile delivery metadata such aspublished_at
This means a repeated execution of the same job can safely overwrite the canonical normalized record for that job while preserving attempt history, and Go can avoid publishing a redundant envelope when the latest effective publication for the same scope has not materially changed.
Active internal endpoints:
GET /internal/v1/health/liveGET /internal/v1/health/readyPOST /internal/v1/jobsGET /internal/v1/jobsGET /internal/v1/jobs/{id}GET /internal/v1/publications/latest?pipeline=...
GET /internal/v1/publications/latest now accepts optional scope filters:
countryevent_typeevent_idprofile_id
When scope filters are present, the lookup resolves the latest effective publication for that exact provider scope instead of the last publication for the whole pipeline.
The API requires an internal token unless ALLOW_INSECURE_INTERNAL_AUTH=true is explicitly set. CORS is not enabled by default.
Important environment variables:
DATABASE_DRIVER=postgres
DATABASE_DSN=postgres://user:password@localhost:5432/smoothcomp_adapter?sslmode=disable
DATABASE_RUN_MIGRATIONS=true
DATABASE_MAX_OPEN_CONNS=10
DATABASE_MAX_IDLE_CONNS=5
DATABASE_CONN_MAX_LIFETIME_SEC=300
WORKER_POLL_INTERVAL_SEC=5
WORKER_LEASE_DURATION_SEC=60
WORKER_HEARTBEAT_INTERVAL_SEC=20
WORKER_MAX_ATTEMPTS=5
WORKER_BASE_RETRY_DELAY_SEC=15
WORKER_MAX_RETRY_DELAY_SEC=300
INTERNAL_AUTH_TOKEN=replace-meValidation currently enforces:
- internal auth token unless insecure mode is explicitly enabled
- supported DB driver
- worker heartbeat lower than lease duration
- positive retry and connection-pool settings
Migrations live under:
migrations/postgresmigrations/sqlite
They are executed automatically on startup when DATABASE_RUN_MIGRATIONS=true.
Current migration flow:
- open DB
- ensure
schema_migrations - run unapplied SQL files in lexical order
- record applied versions
Local fallback mode:
DATABASE_DRIVER=sqlite
DATABASE_DSN=./storage/adapter.db
ALLOW_INSECURE_INTERNAL_AUTH=trueTypical local commands:
go run ./cmd/api
go run ./cmd/workercmd/server still exists, but it should be treated as local convenience only.
smoothcomp.event_catalogsmoothcomp.event_participantssmoothcomp.event_detailsmoothcomp.athlete_profile_enrichmentsmoothcomp.academy_catalog
Fixture-based parser tests live under:
testdata/smoothcomp/eventstestdata/smoothcomp/participantstestdata/smoothcomp/event_detailtestdata/smoothcomp/athletestestdata/smoothcomp/academiestestdata/smoothcomp/audit
Storage lifecycle tests live in:
internal/adapters/storage/gormstore
This repository now includes a deterministic extraction audit runner for evidence-based verification of parser and normalization quality.
Artifacts:
- dataset:
testdata/smoothcomp/audit/dataset.json - raw audit fixtures:
testdata/smoothcomp/audit/fixtures - audit runner:
cmd/audit - current human-readable report:
docs/smoothcomp-extraction-audit.md - match extraction design note:
docs/smoothcomp-match-extraction.md
Run it locally:
go run ./cmd/audit
go run ./cmd/audit -format jsonThe audit compares:
- raw provider snapshots
- normalized adapter output
- manually curated truth assertions with mismatch classification
Mismatch classes currently used:
SOURCE_NOT_VISIBLEPARTIAL_SOURCE_DATAPARSER_DRIFTNORMALIZATION_BUGID_RESOLUTION_BUGSUBDOMAIN_VARIANTEXPECTATION_WAS_WRONGUNSUPPORTED_VARIANT
The following legacy extraction capabilities have now been migrated into first-class provider pipelines:
- event catalog
- event participants
- event detail
- athlete profile enrichment
- academy catalog
The adapter also publishes athlete-centric match history and derived win/loss summaries from profile history where the provider visibly exposes them. Event-centric bracket/results reconstruction has not yet been frozen as a supported audited contract.
Legacy packages remain only as temporary reference for non-runtime helpers and for future cleanup:
internal/scraperinternal/apiinternal/schedulerinternal/config
These packages are no longer the supported runtime path for day-to-day operation.
- Postgres is now wired, but production rollout still needs environment-specific operational packaging and deployment manifests
- metrics export is still hook-oriented/log-oriented; there is not yet a Prometheus/OpenTelemetry adapter
- the adapter now decides publication at source level, but delivery orchestration into Nest is still a separate next step
- provider coverage is broader now, but some Smoothcomp page variants may still require additional fixtures as new markup variations are discovered
- the extraction audit is deterministic and reproducible, but it is still based on a curated corpus rather than large-scale live production sampling
- athlete competitive records are currently strongest from profile-history sources; event-centric bracket/results pages still need archived fixture coverage before that path should be consumed downstream
cmd/serverstill exists for convenience and should not be the default production topology- concurrent jobs for the exact same scope can still race at publication time because there is not yet a dedicated per-scope publication lock or delivery outbox