-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Feature/tokenless diff provider #2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vectorkovacspeter
wants to merge
15
commits into
The-PR-Agent:main
Choose a base branch
from
vectorkovacspeter:feature/tokenless-diff-provider
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
7f63d4e
feat: add unified-diff parser for tokenless diff provider (#2457)
vectorkovacspeter 5f0c079
feat: reconstruct base file by reverse-applying diff (#2457)
vectorkovacspeter 2fe6773
refactor: harden diff base reconstruction (CRLF, newline, tests) (#2457)
vectorkovacspeter 3a7f74e
feat: add DiffGitProvider with stdout/file output (#2457)
vectorkovacspeter eb2bf46
fix: guard malformed diff parsing in DiffGitProvider (#2457)
vectorkovacspeter e4868f8
feat: CLI --stdin/--diff-file/--output for tokenless diff mode (#2457)
vectorkovacspeter 4416ce2
test: end-to-end coverage for tokenless diff provider (#2457)
vectorkovacspeter 8306df7
test: strengthen diff provider e2e (base reconstruction + mocked-LLM)…
vectorkovacspeter 5b83516
docs: document tokenless local diff mode (#2457)
vectorkovacspeter daeb714
fix: harden DiffGitProvider capability gating and file reads (#2457)
vectorkovacspeter 566cbaf
test: make path-traversal guard test a real sentinel (#2457)
vectorkovacspeter 304384d
fix(diff-provider): address Qodo review findings on tokenless diff mode
vectorkovacspeter dfb19b2
fix(diff-provider): resolve Qodo follow-up findings on tokenless diff…
vectorkovacspeter a05bce3
fix(diff-provider): resolve Qodo third-round findings on tokenless di…
vectorkovacspeter ae473c4
fix(diff-provider): hunk-only patches + isort/test-isolation (Qodo ro…
vectorkovacspeter File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Tokenless local diff mode | ||
|
|
||
| Run PR-Agent against a raw unified diff with no platform API token and no PR URL. | ||
| Results are printed to stdout (and optionally saved to a file). This suits security-first | ||
| or air-gapped environments where HTTP access tokens are avoided, and enables pre-push | ||
| hooks or CI pipelines that operate on a diff artifact rather than a live pull request. | ||
|
|
||
| ## Usage | ||
|
|
||
| Pipe a diff directly from stdin: | ||
|
|
||
| ```bash | ||
| git diff main...feature-branch | python -m pr_agent.cli --stdin review | ||
| ``` | ||
|
|
||
| Or pass a diff file and save the output alongside stdout: | ||
|
|
||
| ```bash | ||
| git diff main...feature-branch > changes.diff | ||
| python -m pr_agent.cli --diff-file changes.diff --output review.md review | ||
| ``` | ||
|
|
||
| ### Flags | ||
|
|
||
| | Flag | Description | | ||
| |---|---| | ||
| | `--stdin` | Read a unified diff from stdin | | ||
| | `--diff-file <path>` | Read a unified diff from a file | | ||
| | `--output <path>` | Write the result to a file in addition to stdout | | ||
|
|
||
| `--stdin` and `--diff-file` are mutually exclusive. At least one must be provided to | ||
| enter tokenless diff mode; omitting both falls back to the normal `--pr_url` flow. | ||
|
|
||
| ### Supported commands | ||
|
|
||
| `review`, `improve`, `describe`, and `ask` are supported. Because there is no hosting | ||
| platform to push to, `improve` renders its code suggestions as a single markdown | ||
| document to stdout (and to `--output`, if given) instead of as committable inline | ||
| suggestions. Commands that require live platform interaction (such as | ||
| `update_changelog` or `similar_issue`) are not meaningful in this mode. | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. **Diff parsing** — the unified diff is parsed locally into per-file patch objects. | ||
| Binary files are skipped automatically. | ||
|
|
||
| 2. **Working-tree enrichment** — when PR-Agent is run inside the repository working tree, | ||
| it reads each changed file from disk and reverse-applies the diff to reconstruct the | ||
| base (pre-change) file content. Having both the base and head versions available gives | ||
| the LLM full file context, which produces higher-quality analysis. | ||
|
|
||
| 3. **Patch-only fallback** — if a changed file cannot be found on disk (e.g. the diff was | ||
| generated elsewhere, or the file was deleted), PR-Agent falls back to patch-only mode | ||
| for that file. The review still runs; it simply has less context. | ||
|
|
||
| 4. **Output** — the result is written to stdout. If `--output <path>` is given, it is | ||
| also written to that file (UTF-8, overwritten on each run). | ||
|
|
||
| No platform token, no PR URL, and no internet access are required for the diff processing | ||
| step itself. An LLM API key is still needed unless you configure a local model. | ||
|
|
||
| ## Difference from the `local` git provider | ||
|
|
||
| The existing `git_provider = "local"` mode (invoked with `--pr_url`) computes a diff | ||
| by comparing branches in a local Git repository and requires a clean working tree. | ||
| The diff mode is different in the following ways: | ||
|
|
||
| | | `local` provider | `diff` provider (this page) | | ||
| |---|---|---| | ||
| | Input | Branch names in a local repo | A unified diff supplied via stdin or file | | ||
| | Working tree required | Yes (clean) | No | | ||
| | Platform token required | No | No | | ||
| | Output | GitHub-style comment published locally | stdout (+ optional file) | | ||
| | Inline comments | Not supported | Not supported | | ||
|
|
||
| Use the `diff` provider when you already have a diff artifact (e.g. from a CI step or | ||
| `git format-patch`) and want a zero-configuration, token-free review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| from unidiff import PatchSet | ||
| from unidiff.errors import UnidiffParseError | ||
|
|
||
| from pr_agent.algo.types import EDIT_TYPE, FilePatchInfo | ||
| from pr_agent.log import get_logger | ||
|
|
||
|
|
||
| def _strip_prefix(path: str | None) -> str | None: | ||
| if path is None: | ||
| return None | ||
| if path.startswith(("a/", "b/")): | ||
| return path[2:] | ||
| return path | ||
|
|
||
|
|
||
| def to_hunk_only_patch(patch_str: str) -> str: | ||
| """Drop file-header lines ('diff --git', 'index', '---', '+++') that precede | ||
| the first '@@' hunk header. | ||
|
|
||
| Platform providers (GitHub, GitLab, ...) store hunk-only patches, and the | ||
| shared hunk/line-number converter treats any '+'/'-' line as content. Left | ||
| in, the '---'/'+++' headers would be emitted as a bogus leading hunk with | ||
| invalid line numbers. Returns "" when there is no hunk (e.g. rename-only).""" | ||
| lines = patch_str.splitlines(keepends=True) | ||
| for i, line in enumerate(lines): | ||
| if line.startswith("@@"): | ||
| return "".join(lines[i:]) | ||
| return "" | ||
|
|
||
|
|
||
| def parse_unified_diff(diff_text: str) -> list[FilePatchInfo]: | ||
| """Parse a unified diff into FilePatchInfo objects (patch + metadata only). | ||
|
|
||
| base_file / head_file are left empty here; the provider fills them from the | ||
| working tree and by reverse-applying the patch. Binary files are skipped. | ||
| """ | ||
| patch_set = PatchSet(diff_text) | ||
| files: list[FilePatchInfo] = [] | ||
| for pf in patch_set: | ||
| if pf.is_binary_file: | ||
| get_logger().info(f"Skipping binary file in diff: {pf.path}") | ||
| continue | ||
| if pf.is_added_file: | ||
| edit_type = EDIT_TYPE.ADDED | ||
| elif pf.is_removed_file: | ||
| edit_type = EDIT_TYPE.DELETED | ||
| elif pf.is_rename: | ||
| edit_type = EDIT_TYPE.RENAMED | ||
| else: | ||
| edit_type = EDIT_TYPE.MODIFIED | ||
|
|
||
| if pf.is_removed_file: # target is /dev/null: use source path as the name | ||
| filename = _strip_prefix(pf.source_file) | ||
| else: | ||
| filename = _strip_prefix(pf.target_file) | ||
| old_filename = _strip_prefix(pf.source_file) if pf.is_rename else None | ||
|
|
||
| files.append( | ||
| FilePatchInfo( | ||
| base_file="", | ||
| head_file="", | ||
| patch=str(pf), | ||
| filename=filename, | ||
| edit_type=edit_type, | ||
| old_filename=old_filename, | ||
| ) | ||
|
qodo-free-for-open-source-projects[bot] marked this conversation as resolved.
|
||
| ) | ||
| return files | ||
|
|
||
|
|
||
| def reconstruct_base_file(head_file_str: str, patch_str: str) -> str: | ||
| """Reverse-apply a single-file unified diff to head (new) content to recover | ||
| base (original) content. Returns "" if the patch does not cleanly apply.""" | ||
| try: | ||
| patch_set = PatchSet(patch_str) | ||
| except UnidiffParseError as e: | ||
| get_logger().info(f"Could not parse patch for base reconstruction: {e}") | ||
| return "" | ||
|
qodo-free-for-open-source-projects[bot] marked this conversation as resolved.
|
||
| if len(patch_set) != 1: | ||
| return "" | ||
|
|
||
| head_lines = head_file_str.splitlines() | ||
| base_lines: list[str] = [] | ||
| head_idx = 0 # 0-based cursor into head_lines | ||
|
|
||
| for hunk in patch_set[0]: | ||
| hunk_head_start = hunk.target_start - 1 # 1-based -> 0-based | ||
| if hunk_head_start < head_idx or hunk_head_start > len(head_lines): | ||
| return "" # out-of-order / out-of-bounds hunk | ||
| base_lines.extend(head_lines[head_idx:hunk_head_start]) | ||
| head_idx = hunk_head_start | ||
|
|
||
| for line in hunk: | ||
| value = line.value.rstrip("\r\n") | ||
| if line.is_context: | ||
| if head_idx >= len(head_lines) or head_lines[head_idx] != value: | ||
| return "" | ||
| base_lines.append(head_lines[head_idx]) | ||
| head_idx += 1 | ||
| elif line.is_added: # present only in head: verify + skip | ||
| if head_idx >= len(head_lines) or head_lines[head_idx] != value: | ||
| return "" | ||
| head_idx += 1 | ||
| elif line.is_removed: # present only in base: emit, don't consume head | ||
| base_lines.append(value) | ||
|
|
||
| base_lines.extend(head_lines[head_idx:]) | ||
| result = "\n".join(base_lines) | ||
| # Preserve a trailing newline only when the base actually has content; an | ||
| # empty base (e.g. reversing an add-file patch) must stay "" so downstream | ||
| # extend_patch() correctly treats it as having no original file. | ||
| if base_lines and head_file_str.endswith("\n"): | ||
| result += "\n" | ||
|
qodo-free-for-open-source-projects[bot] marked this conversation as resolved.
|
||
| return result | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.