Intelligent File Grouping: Progress and Challenges with LLM-Driven Approach #196
lizhengfeng101
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background
When processing large PRs, OCR needs to intelligently group changed files so that functionally related files can be reviewed together within a single LLM context window. Our internal version achieves this through an embedding model for semantic clustering, which delivers strong results.
To bring this capability to open-source users, we developed an alternative approach that does not depend on an embedding model — instead, it leverages the user's already-configured LLM to drive file grouping. The implementation is available on the
feat/groupingbranch.How It Works
Why We're Not Releasing It Yet
We evaluated this approach against 200 real PRs and found that the LLM-driven grouping shows a noticeable drop in F1 score compared to our internal embedding-based approach. The main reason is that the LLM relies solely on file paths and change statistics for grouping decisions, lacking the deeper semantic understanding of file contents that embeddings provide. This leads to insufficient grouping accuracy in complex scenarios.
Next Steps
We will continue exploring better alternatives for the open-source version, with the goal of approaching the internal solution's grouping quality without requiring an additional embedding model dependency. If the community has ideas or suggestions, we'd love to hear them in this discussion!
Beta Was this translation helpful? Give feedback.
All reactions