Skip to content

Automate GitHub contribution stats updates#16

Open
0xbrayo wants to merge 4 commits into
ActivityWatch:masterfrom
0xbrayo:automated-github-stats
Open

Automate GitHub contribution stats updates#16
0xbrayo wants to merge 4 commits into
ActivityWatch:masterfrom
0xbrayo:automated-github-stats

Conversation

@0xbrayo

@0xbrayo 0xbrayo commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

The GitHub stats table on the website's contributors page has been updated manually — last on 2024-07-11 — because github_stats.py refetched all history every run and crashed on the first RateLimitExceededException, making it impossible to run under the Actions GITHUB_TOKEN budget (1,000 requests/hour). This PR makes the data collection automatic and incremental, and commits only the data (not the rendered HTML) so the consumer can render it.

Four commits:

  1. ci: a scheduled workflow runs the sync every 6 hours (00:00 UTC start, so the website's 00:30 UTC build sees data at most ~30 min stale) and commits the updated github-stats-state.json when it changes. Also bumps the Build workflow to Python 3.12 and passes GITHUB_TOKEN to the test step so the API tests stop hitting anonymous rate limits.
  2. feat: the sync is now incremental and rate-limit-resilient:
    • cumulative per-repo/per-user stats persist in github-stats-state.json, with a per-stat last-synced timestamp, so runs only fetch the delta since the previous sync;
    • on a rate limit it checkpoints state (committed before sleeping) and sleeps until the window resets instead of crashing, so interrupted runs resume without losing or double-counting work;
    • removed the per-comment get_reactions() calls (one API request per comment; the result was unused), fetches with per_page=100, and derives active days from objects the fetchers already iterate — a full-history sync now needs ~350–550 requests and incremental runs a few dozen;
    • bots are excluded from collected stats (same endswith("[bot]") filter the rendered table uses), and issues -= prs clips instead of asserting (a PR created mid-run can briefly exceed the issue count; the next incremental run self-corrects);
    • rendering is split from syncing: github_stats.py --render-only (and the make render target) rebuilds the HTML table from the committed state with no API calls and no token, so the website can generate the table itself;
    • numpy/pandas/joblib/mypy bumped for Python 3.12.
  3. chore: the github-stats-state.json from a full sync, so the scheduled runs start incremental immediately. The rendered HTML is intentionally not committed — it's a build artifact the consumer produces.
  4. ci: a monthly full resync (1st, 01:00 UTC) and a manual workflow_dispatch toggle discard the saved state and rebuild from scratch, re-attributing renamed/deleted users (incremental syncs never re-read history, so a rename otherwise leaves a frozen row under the old login).

Testing

  • Verified end-to-end on my fork: a full-history sync (24 repos, ~21 min, 1,785 contributors) and an incremental run (~2 min, no rate-limit sleeps).
  • --render-only reproduces the previously-committed table byte-for-byte from the committed state, with no token.
  • make test (4 live-API tests) and make typecheck pass.

Follow-up

A companion PR to activitywatch.github.io (ActivityWatch/activitywatch.github.io#55) makes the website render this table from the committed state during its daily build, replacing the manually-pasted _includes/tables/github-stats.html. That PR depends on this one.

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR automates the previously manual GitHub contributor stats update by introducing an incremental sync engine and a scheduled Actions workflow. The core change is github_stats.py, which now persists cumulative per-repo/per-user stats in github-stats-state.json with per-stat timestamps, fetches only the delta since the last sync, and handles rate-limit exceptions by checkpointing state and sleeping until the window resets rather than crashing.

  • Incremental state engine: cumulative stats are loaded from github-stats-state.json, per-stat last_synced timestamps drive since parameters, and _merge_stat set-unions active days so re-processing the same day is harmless.
  • Rate-limit resilience: RateLimitExceededException is caught in a retry loop; before sleeping, _commit_state() commits and pushes a checkpoint so progress survives if the Actions runner is cancelled or times out.
  • Scheduled workflow: update-contributions.yml runs every 6 hours and commits both the HTML table and state file when they change; it currently lacks a concurrency group, which can lead to push conflicts when a workflow_dispatch overlaps a scheduled run.

Confidence Score: 4/5

The Python-side incremental sync is well-structured and the rate-limit checkpoint mechanism is sound; the main gap is in the workflow configuration, where concurrent runs can produce a rejected push and leave the HTML table one cycle stale.

The incremental state engine, merge logic, and rate-limit recovery path are all carefully designed and tested end-to-end. The one real operational gap is the missing concurrency key in update-contributions.yml: a manual workflow_dispatch triggered while a scheduled run is in progress — or a rate-limit sleep extending a run past the 6-hour mark — will result in a non-fast-forward push rejection on the final commit step, leaving the HTML table update silently skipped for that cycle. Adding two lines to the workflow file closes the gap entirely.

update-contributions.yml needs a concurrency block before this is fully production-ready; all other files look correct.

Important Files Changed

Filename Overview
.github/workflows/update-contributions.yml New scheduled workflow runs stats sync every 6 hours; missing concurrency guard means overlapping runs (e.g., workflow_dispatch + scheduled, or rate-limit sleep > 6 h) will cause rejected push failures on the final commit step.
src/contributor_stats/github_stats.py Large rewrite adding incremental state, per-stat timestamps, rate-limit sleeping, and mid-run checkpoint commits. Logic for merging deltas and persisting state is sound; active-days tracking correctly unified via set-union across all stat types.
.github/workflows/build.yml Bumped matrix Python to 3.12 and added GITHUB_TOKEN env var to the test step to avoid anonymous rate limits during API tests.
tests/test_github.py Assertions updated to match new CountWithDays/CommentStats TypedDict return shapes; changes are correct.
pyproject.toml Pinned numpy, pandas, joblib, and mypy for Python 3.12 compatibility; migrated dev deps to poetry groups format.
src/contributor_stats/main.py Single-line mypy suppression added for a dict-item type error in merge_tables; no functional change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Workflow trigger\nschedule / workflow_dispatch] --> B[actions/checkout]
    B --> C[poetry install]
    C --> D[github_stats.py main]
    D --> E[_load_state\ngithub-stats-state.json]
    E --> F[Get whitelisted repos\nfrom ActivityWatch org]
    F --> G{RateLimitExceededException?}
    G -- Yes --> H[_sleep_until_rate_limit_reset]
    H --> F
    G -- No --> I[for each repo: _sync_repo]
    I --> J[for each stat fetcher]
    J --> K{Fetch delta since\nlast_synced timestamp}
    K -- RateLimitExceededException --> L[_save_state\n_commit_state checkpoint push\nsleep until reset]
    L --> K
    K -- Success --> M[_merge_stat\naccumulate into state]
    M --> N[Update last_synced for this stat]
    N --> J
    J -- All stats done --> O[_save_state full repo state]
    O --> I
    I -- All repos done --> P[Build RepoStats DataFrames]
    P --> Q[Write github-activity-table.html]
    Q --> R[git add html + state.json]
    R --> S{Changes staged?}
    S -- No --> T[No commit needed]
    S -- Yes --> U[git commit + git push]
    U --> V[Done]
Loading

Reviews (1): Last reviewed commit: "chore: add generated contributions table..." | Re-trigger Greptile

Comment thread .github/workflows/update-contributions.yml
Comment thread .github/workflows/update-contributions.yml Outdated
Comment thread github-activity-table.html Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not be committed, right? Just the JSON and the HTML generated?

@0xbrayo 0xbrayo Jun 12, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the table with the github stats generated by the ci run. I can omit it from the initial commit but it will be generated and committed to the repo when ci runs. I was considering using ci artifacts instead of committing it directly, but it seemed to be too much hassle.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why does it need to be committed and generated in CI?

Consumers like the contributors page should generate it from the JSON (which is committed to enable the incremental collection iiuc).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a better design, will do

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the HTML is no longer committed here. github_stats.py --render-only (and a make render target) now regenerates the table from github-stats-state.json with no API calls, and the contributors-page build renders it from the committed JSON (companion PR ActivityWatch/activitywatch.github.io#55). Only the state JSON is committed by the workflow now.

@0xbrayo 0xbrayo force-pushed the automated-github-stats branch from 5889330 to 620b72f Compare June 13, 2026 05:54
0xbrayo added 4 commits June 13, 2026 09:10
- New 'Update GitHub contributions table' workflow runs github_stats.py
  every 6 hours (starting 00:00 UTC, so the website's 00:30 UTC build
  sees data at most ~30 minutes stale) and commits the regenerated sync
  state when it changes. The HTML table is rendered by the consumer (the
  website) from that state, so only the data is committed here.
- A concurrency group queues overlapping runs (manual dispatch during a
  scheduled run, or a rate-limit sleep crossing the next tick) so the
  final push can't be rejected as non-fast-forward.
- Build workflow: bump Python to 3.12 and pass GITHUB_TOKEN to the test
  step so the API tests don't hit anonymous rate limits.
The stats script refetched all history every run and crashed on the
first RateLimitExceededException, which made it impossible to run under
the Actions GITHUB_TOKEN budget of 1,000 requests/hour.

- Persist cumulative per-repo/per-user stats in github-stats-state.json;
  each stat keeps its own last-synced timestamp so runs only fetch the
  delta since the previous sync.
- Sleep until the rate-limit window resets instead of crashing, and
  commit a state checkpoint before sleeping so an interrupted run
  resumes without losing or double-counting work.
- Drop the per-comment get_reactions() calls (one API request per
  comment, result unused downstream) and fetch with per_page=100;
  derive active days from objects the fetchers already iterate instead
  of refetching everything a second time. A full-history sync now needs
  ~350-550 requests; incremental runs need a few dozen.
- Exclude bots from collected stats (same endswith('[bot]') filter the
  rendered table uses) and clip issues-minus-prs instead of asserting,
  since a PR created mid-run can briefly exceed the issue count.
- Split rendering from syncing: 'github_stats.py --render-only' (and the
  'make render' target) rebuilds the HTML table from the committed state
  with no API calls or token, so the website can generate it itself.
- Bump numpy/pandas/joblib/mypy for Python 3.12 compatibility.
github-stats-state.json is the cumulative GitHub stats sync state that
the scheduled workflow updates incrementally each run. Committing a full
sync's state lets the first scheduled run start incremental immediately.
The HTML table is not committed; the website renders it from this file.
Incremental syncs never re-read history, so a user who renames or
deletes their account keeps a frozen row under the old login while new
activity accrues under the new one. A monthly run (1st, 01:00 UTC) and
a manual workflow_dispatch toggle discard the saved state so the sync
rebuilds from scratch (~20 min, ~400-550 requests) and everything is
re-attributed to current logins.
@0xbrayo 0xbrayo force-pushed the automated-github-stats branch from 620b72f to c79b4e7 Compare June 13, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants