Summary
When a scanned skill's content (or LLM output quoting it) contains ANSI escape
sequences or control bytes (e.g. NUL \x00, ESC \x1b), those bytes flow
verbatim into finding text and then into the generated report. The result is a
report file that tools treat as binary:
- GitLab / GitHub / editors detect the
.md as binary and offer
"download" instead of rendering it (file report.md → data, not text).
- Terminal output is garbled by stray escape sequences.
Repro
Scan any skill whose content includes a terminal-colored snippet or a stray
control byte (real-world: build/log-analysis skills that embed ANSI codes), then
write a Markdown report:
skillspector scan ./skill/ --format markdown --output report.md
file report.md # -> "data" (binary), should be UTF-8 text
The report won't render inline in GitLab/GitHub; you only get a download link.
Expected
Reports should always be clean UTF-8. Note this is distinct from #144 (skipping
binary input files) — here the problem is control bytes in the output report,
which can originate from any quoted snippet even in a text file.
Proposed fix
Strip ANSI escape sequences and disallowed control chars (keeping tab/newline and
multibyte UTF-8) from finding text in the report node, so every format
(terminal/JSON/Markdown/SARIF) stays clean. PR incoming.
## 2. PR is ready to open (branch pushed: `assinchu:feature/report-sanitizer`)
Once the issue has a number (say **#NNN**), I'll open the PR with this body:
```markdown
## Summary
Fixes #NNN. Scanned content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim,
these make a report register as **binary** — GitLab/editors offer "download"
instead of rendering the Markdown, and terminals print garbled output.
## Change
Sanitize every finding's free-text fields once in the **report node** (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay
clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers)
are preserved. Non-text fields and counts are untouched.
- `_clean_text` / `_sanitize_finding` + `_ANSI_RE` / `_CONTROL_RE` in `report.py`
- Applied to `filtered_findings` at the top of `report()` before scoring/format
## Tests
New `tests/nodes/test_report_sanitizer.py`: unit tests for `_clean_text` /
`_sanitize_finding`, plus a parametrized check that no `\x00`/`\x1b` leaks into
any of the four output formats while readable content survives. Full suite green;
`ruff check`/`format` clean.
This is distinct from #144 (binary *input* files); it sanitizes the *output*.
Summary
When a scanned skill's content (or LLM output quoting it) contains ANSI escape
sequences or control bytes (e.g.
NUL\x00,ESC\x1b), those bytes flowverbatim into finding text and then into the generated report. The result is a
report file that tools treat as binary:
.mdas binary and offer"download" instead of rendering it (
file report.md→data, not text).Repro
Scan any skill whose content includes a terminal-colored snippet or a stray
control byte (real-world: build/log-analysis skills that embed ANSI codes), then
write a Markdown report: