Automate GitHub contribution stats updates#16
Conversation
Greptile SummaryThis PR automates the previously manual GitHub contributor stats update by introducing an incremental sync engine and a scheduled Actions workflow. The core change is
Confidence Score: 4/5The Python-side incremental sync is well-structured and the rate-limit checkpoint mechanism is sound; the main gap is in the workflow configuration, where concurrent runs can produce a rejected push and leave the HTML table one cycle stale. The incremental state engine, merge logic, and rate-limit recovery path are all carefully designed and tested end-to-end. The one real operational gap is the missing
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Workflow trigger\nschedule / workflow_dispatch] --> B[actions/checkout]
B --> C[poetry install]
C --> D[github_stats.py main]
D --> E[_load_state\ngithub-stats-state.json]
E --> F[Get whitelisted repos\nfrom ActivityWatch org]
F --> G{RateLimitExceededException?}
G -- Yes --> H[_sleep_until_rate_limit_reset]
H --> F
G -- No --> I[for each repo: _sync_repo]
I --> J[for each stat fetcher]
J --> K{Fetch delta since\nlast_synced timestamp}
K -- RateLimitExceededException --> L[_save_state\n_commit_state checkpoint push\nsleep until reset]
L --> K
K -- Success --> M[_merge_stat\naccumulate into state]
M --> N[Update last_synced for this stat]
N --> J
J -- All stats done --> O[_save_state full repo state]
O --> I
I -- All repos done --> P[Build RepoStats DataFrames]
P --> Q[Write github-activity-table.html]
Q --> R[git add html + state.json]
R --> S{Changes staged?}
S -- No --> T[No commit needed]
S -- Yes --> U[git commit + git push]
U --> V[Done]
Reviews (1): Last reviewed commit: "chore: add generated contributions table..." | Re-trigger Greptile |
There was a problem hiding this comment.
This should probably not be committed, right? Just the JSON and the HTML generated?
There was a problem hiding this comment.
This is the table with the github stats generated by the ci run. I can omit it from the initial commit but it will be generated and committed to the repo when ci runs. I was considering using ci artifacts instead of committing it directly, but it seemed to be too much hassle.
There was a problem hiding this comment.
But why does it need to be committed and generated in CI?
Consumers like the contributors page should generate it from the JSON (which is committed to enable the incremental collection iiuc).
There was a problem hiding this comment.
that's a better design, will do
There was a problem hiding this comment.
Done — the HTML is no longer committed here. github_stats.py --render-only (and a make render target) now regenerates the table from github-stats-state.json with no API calls, and the contributors-page build renders it from the committed JSON (companion PR ActivityWatch/activitywatch.github.io#55). Only the state JSON is committed by the workflow now.
5889330 to
620b72f
Compare
- New 'Update GitHub contributions table' workflow runs github_stats.py every 6 hours (starting 00:00 UTC, so the website's 00:30 UTC build sees data at most ~30 minutes stale) and commits the regenerated sync state when it changes. The HTML table is rendered by the consumer (the website) from that state, so only the data is committed here. - A concurrency group queues overlapping runs (manual dispatch during a scheduled run, or a rate-limit sleep crossing the next tick) so the final push can't be rejected as non-fast-forward. - Build workflow: bump Python to 3.12 and pass GITHUB_TOKEN to the test step so the API tests don't hit anonymous rate limits.
The stats script refetched all history every run and crashed on the
first RateLimitExceededException, which made it impossible to run under
the Actions GITHUB_TOKEN budget of 1,000 requests/hour.
- Persist cumulative per-repo/per-user stats in github-stats-state.json;
each stat keeps its own last-synced timestamp so runs only fetch the
delta since the previous sync.
- Sleep until the rate-limit window resets instead of crashing, and
commit a state checkpoint before sleeping so an interrupted run
resumes without losing or double-counting work.
- Drop the per-comment get_reactions() calls (one API request per
comment, result unused downstream) and fetch with per_page=100;
derive active days from objects the fetchers already iterate instead
of refetching everything a second time. A full-history sync now needs
~350-550 requests; incremental runs need a few dozen.
- Exclude bots from collected stats (same endswith('[bot]') filter the
rendered table uses) and clip issues-minus-prs instead of asserting,
since a PR created mid-run can briefly exceed the issue count.
- Split rendering from syncing: 'github_stats.py --render-only' (and the
'make render' target) rebuilds the HTML table from the committed state
with no API calls or token, so the website can generate it itself.
- Bump numpy/pandas/joblib/mypy for Python 3.12 compatibility.
github-stats-state.json is the cumulative GitHub stats sync state that the scheduled workflow updates incrementally each run. Committing a full sync's state lets the first scheduled run start incremental immediately. The HTML table is not committed; the website renders it from this file.
Incremental syncs never re-read history, so a user who renames or deletes their account keeps a frozen row under the old login while new activity accrues under the new one. A monthly run (1st, 01:00 UTC) and a manual workflow_dispatch toggle discard the saved state so the sync rebuilds from scratch (~20 min, ~400-550 requests) and everything is re-attributed to current logins.
620b72f to
c79b4e7
Compare
Summary
The GitHub stats table on the website's contributors page has been updated manually — last on 2024-07-11 — because
github_stats.pyrefetched all history every run and crashed on the firstRateLimitExceededException, making it impossible to run under the ActionsGITHUB_TOKENbudget (1,000 requests/hour). This PR makes the data collection automatic and incremental, and commits only the data (not the rendered HTML) so the consumer can render it.Four commits:
github-stats-state.jsonwhen it changes. Also bumps the Build workflow to Python 3.12 and passesGITHUB_TOKENto the test step so the API tests stop hitting anonymous rate limits.github-stats-state.json, with a per-stat last-synced timestamp, so runs only fetch the delta since the previous sync;get_reactions()calls (one API request per comment; the result was unused), fetches withper_page=100, and derives active days from objects the fetchers already iterate — a full-history sync now needs ~350–550 requests and incremental runs a few dozen;endswith("[bot]")filter the rendered table uses), andissues -= prsclips instead of asserting (a PR created mid-run can briefly exceed the issue count; the next incremental run self-corrects);github_stats.py --render-only(and themake rendertarget) rebuilds the HTML table from the committed state with no API calls and no token, so the website can generate the table itself;github-stats-state.jsonfrom a full sync, so the scheduled runs start incremental immediately. The rendered HTML is intentionally not committed — it's a build artifact the consumer produces.workflow_dispatchtoggle discard the saved state and rebuild from scratch, re-attributing renamed/deleted users (incremental syncs never re-read history, so a rename otherwise leaves a frozen row under the old login).Testing
--render-onlyreproduces the previously-committed table byte-for-byte from the committed state, with no token.make test(4 live-API tests) andmake typecheckpass.Follow-up
A companion PR to activitywatch.github.io (ActivityWatch/activitywatch.github.io#55) makes the website render this table from the committed state during its daily build, replacing the manually-pasted
_includes/tables/github-stats.html. That PR depends on this one.