Skip to content

[BUG] Reports containing scanned content with ANSI/control bytes are emitted as binary (break Markdown rendering) #186

Description

@assinchu

Summary

When a scanned skill's content (or LLM output quoting it) contains ANSI escape
sequences or control bytes (e.g. NUL \x00, ESC \x1b), those bytes flow
verbatim into finding text and then into the generated report. The result is a
report file that tools treat as binary:

  • GitLab / GitHub / editors detect the .md as binary and offer
    "download" instead of rendering it (file report.mddata, not text).
  • Terminal output is garbled by stray escape sequences.

Repro

Scan any skill whose content includes a terminal-colored snippet or a stray
control byte (real-world: build/log-analysis skills that embed ANSI codes), then
write a Markdown report:

skillspector scan ./skill/ --format markdown --output report.md
file report.md      # -> "data" (binary), should be UTF-8 text

The report won't render inline in GitLab/GitHub; you only get a download link.

Expected

Reports should always be clean UTF-8. Note this is distinct from #144 (skipping
binary input files) — here the problem is control bytes in the output report,
which can originate from any quoted snippet even in a text file.

Proposed fix

Strip ANSI escape sequences and disallowed control chars (keeping tab/newline and
multibyte UTF-8) from finding text in the report node, so every format
(terminal/JSON/Markdown/SARIF) stays clean. PR incoming.

## 2. PR is ready to open (branch pushed: `assinchu:feature/report-sanitizer`)

Once the issue has a number (say **#NNN**), I'll open the PR with this body:

```markdown
## Summary

Fixes #NNN. Scanned content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim,
these make a report register as **binary** — GitLab/editors offer "download"
instead of rendering the Markdown, and terminals print garbled output.

## Change

Sanitize every finding's free-text fields once in the **report node** (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay
clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers)
are preserved. Non-text fields and counts are untouched.

- `_clean_text` / `_sanitize_finding` + `_ANSI_RE` / `_CONTROL_RE` in `report.py`
- Applied to `filtered_findings` at the top of `report()` before scoring/format

## Tests

New `tests/nodes/test_report_sanitizer.py`: unit tests for `_clean_text` /
`_sanitize_finding`, plus a parametrized check that no `\x00`/`\x1b` leaks into
any of the four output formats while readable content survives. Full suite green;
`ruff check`/`format` clean.

This is distinct from #144 (binary *input* files); it sanitizes the *output*.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions