Skip to content

Feat/add validations#106

Merged
antonio-olleros merged 80 commits into
mainfrom
feat/add-validations
Mar 18, 2026
Merged

Feat/add validations#106
antonio-olleros merged 80 commits into
mainfrom
feat/add-validations

Conversation

@antonio-olleros

Copy link
Copy Markdown
Collaborator

Description

Add a comprehensive validation engine to xbridge that checks XBRL instance files (both XML and CSV formats) against 90+ structural and EBA regulatory rules. This includes a standalone
validation API, CLI integration, a validate-convert-validate pipeline, and full test coverage. The version is bumped to 2.0.0.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Dependency update
  • Other (please describe):

Related Issues

Closes #
Related #

Changes Made

  • Validation engine: New xbridge.validation module with registry, context, engine, and models — rule-based architecture that discovers and executes validation functions by code.
  • XML validation rules (30+): Well-formedness (XML-001..003), schemaRef (XML-010/012), filing indicators (XML-020..026), context structure (XML-030..035), fact structure
    (XML-040..043), unit UTR reference (XML-050), document-level checks (XML-060..069), and taxonomy conformance (XML-070..072).
  • CSV validation rules (30+): Report package structure (CSV-001..006), report.json metadata (CSV-010..016), parameters.csv (CSV-020..026), FilingIndicators.csv (CSV-030..035),
    data table checks (CSV-040..049), fact-level checks (CSV-050..052), and taxonomy conformance (CSV-060..062).
  • EBA-specific rules (25+): Entity identifier (EBA-ENTITY), currency (EBA-CUR), units (EBA-UNIT), decimals accuracy (EBA-DEC), guidance compliance (EBA-GUIDE), file naming
    conventions (EBA-NAME), and supplementary regulatory checks (EBA-2.x).
  • CLI integration: New validate subcommand with --eba, --post-conversion, and --json flags. New --validate and --eba flags on the convert command for
    validate-convert-validate pipeline. - Converter fixes: Updated reportPackage.json and report.json to use final XBRL specification URLs (passes CSV-003 and CSV-011).
  • EBA Taxonomy 4.2.1: Added finrep9dp module support.
  • Documentation: New docs/validation.rst and docs/validation_rules.rst with API reference, usage examples, and complete rule catalog aligned to EBA Filing Rules v5.8.

Testing

Tests Added

  • Unit tests
  • Integration tests
  • Test coverage maintained or improved

Testing Performed

Tests cover all validation rule modules, the engine, registry, models, context, API, and the validate-convert-validate pipeline.

pytest tests/                                             

Test results:                                                                                                                                                                          
- All existing tests pass
- New tests pass                                                                                                                                                                       
- Manual testing performed                                

Documentation                                                                                                                                                                          

- Updated docstrings                                                                                                                                                                   
- Updated README.md                                       
- Updated documentation in docs/                                                                                                                                                       
- Updated CHANGELOG.md (added entry under "Unreleased")                                                                                                                                
- No documentation needed for this change                                                                                                                                              
                                                                                                                                                                                       
Code Quality                                                                                                                                                                           
                                                                                                                                                                                       
- Code follows the project's style guidelines (Ruff)                                                                                                                                   
- Ran ruff check and ruff format
- Ran mypy type checking                                                                                                                                                               
- Self-review of code completed                                                                                                                                                        
- Comments added for complex/non-obvious code                                                                                                                                          
- No new warnings generated                                                                                                                                                            
                                                                                                                                                                                       
Breaking Changes                                                                                                                                                                       
                                                                                                                                                                                       
Impact:                                                                                                                                                                                
- Converter output URLs updated to final XBRL spec URLs — regenerated CSV packages will differ from previous output.                                                                              
                                                                                                                                                                                       
Migration guide:                                                                                                                                                                       
- Re-run conversions if downstream tooling compares output ZIPs byte-for-byte.                                                                                                         
                                                                              
Screenshots (if applicable)                                                                                                                                                            
                                                                                                                                                                                       
N/A                                                                                                                                                                                    
                                                                                                                                                                                       
Checklist                                                 

- My code follows the project's code style                                                                                                                                             
- I have performed a self-review of my code
- I have commented my code, particularly in hard-to-understand areas                                                                                                                   
- I have made corresponding changes to the documentation  
- My changes generate no new warnings                                                                                                                                                  
- I have added tests that prove my fix is effective or that my feature works
- New and existing unit tests pass locally with my changes                                                                                                                             
- Any dependent changes have been merged and published
- I have updated the CHANGELOG.md                                                                                                                                                      
                                                                                                                                                                                       
Additional Notes                                                                                                                                                                       
                                                                                                                                                                                       
This PR contains 78 commits spanning the full validation feature build-out, from initial architecture through all rule implementations, performance optimizations, and release         
candidates up to v2.0.0. Rules are aligned with EBA Filing Rules v5.8.
                                                                                                                                                                                       
Reviewer Notes                                                                                                                                                                         

Areas to focus on:                                                                                                                                                                     
- Validation engine architecture (src/xbridge/validation/_engine.py, _registry.py, _models.py)                                                                                                                                        
- CSV and XML rule correctness, especially taxonomy conformance checks
                                                                                                                                                                                       
Questions for reviewers:                                                                                                                                                               

antonio-olleros and others added 30 commits February 5, 2026 20:45
- Add explicit CSV guard to the post_conversion filter in §2.1 decision
  diagram, making it consistent with the §1.2 prose that states
  post_conversion has no effect for .xbrl files.
- Replace imprecise claim in §2.2 that "sections 2.8–2.14 are executed"
  with accurate description: only rules marked Post-conv. = Yes survive
  (sections 2.11–2.14), while entity (§2.8), decimals (§2.9), and
  currency (§2.10) are also skipped.

Closes #62, Closes #63

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define the technical architecture for the xbridge validation module:
- Rule registry schema (registry.json) with format-specific overrides
- Module structure with 22 rule implementation files
- Core components: models, registry, engine, context
- Rule selection logic matching the specification decision diagram
- Public API (validate function) and integration points
- Full rule coverage summary (98 unique rules)

Companion to validation_specification.md and validations_enumeration.md.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revised version and date to reflect the latest draft.
- Enhanced rule attributes section to clarify execution conditions.
- Organized rules into XML Instance and CSV Report Package categories.
- Updated descriptions and added EBA references for various rules.
- Improved clarity and consistency in rule formatting and structure.
Implement the core data classes for the validation module:
- Severity enum (ERROR, WARNING, INFO)
- RuleDefinition with format-specific severity/message overrides
- ValidationResult matching specification §1.5

Includes 17 passing tests, ruff clean, mypy strict clean.

Closes #65

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the registry module linking JSON rule definitions to Python
implementation functions via a decorator pattern:
- load_registry() reads and parses registry.json
- @rule_impl decorator registers implementation functions
- get_rule_impl() resolves implementations with format-specific priority
- Initial registry.json with XML-001 entry

Includes 12 passing tests, ruff clean, mypy strict clean.

Closes #66

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the context object passed to every rule implementation:
- Carries rule_set, rule_definition, file_path, raw_bytes, and
  parsed instances (xml_instance, csv_instance, module)
- add_finding() renders message templates with format_map and
  gracefully handles missing placeholders
- Respects format-specific severity/message overrides from registry

Includes 10 passing tests, ruff clean, mypy strict clean.

Closes #67

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the orchestration layer tying registry, context, and rules:
- select_rules() filters by format, EBA flag, and post-conversion
- run_validation() detects format, loads registry, parses input,
  resolves taxonomy module, and executes rule implementations
- Graceful handling of parse failures and missing implementations

Includes 16 passing tests, ruff clean, mypy strict clean.

Closes #68

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add XML-001 well-formedness check (xml_wellformedness.py)
- Add `xbridge validate` CLI subcommand with --eba, --post-conversion, --json flags
- Update CLI documentation with validate command usage and examples
- Fix test setup_method pattern to avoid double-registration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add XML-002 rule that verifies the XML declaration encoding is UTF-8
(case-insensitive). Files without an explicit encoding attribute pass
since UTF-8 is the XML default. This is an EBA-only rule (ref §1.4).

Closes #70

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add XML-003 rule that verifies the root element is
{http://www.xbrl.org/2003/instance}xbrl. Skips silently on malformed
XML (XML-001 handles that). Non-EBA rule, always runs.

Closes #71

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
XML-010: Exactly one link:schemaRef element MUST be present.
XML-012: The schemaRef MUST resolve to a known entry point URL.

Closes #72

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ng indicator structural checks

XML-020: At least one find:fIndicators element MUST be present.
XML-021: At least one filing indicator MUST exist.
XML-025: No duplicate filing indicators.
XML-026: Filing indicator contexts MUST NOT contain segment or scenario.

Closes #73

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates that filing indicator codes match known table codes from the
module JSON. Uses ctx.module.tables to build the set of valid codes,
avoiding direct taxonomy access.

Closes #74

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add six EBA context validation rules:
- XML-030: period dates must be xs:date (no dateTime/timezone)
- XML-031: all periods must be instants (not durations)
- XML-032: all periods must share the same reference date
- XML-033: all entity identifiers must be identical across contexts
- XML-034: xbrli:segment must not be used
- XML-035: xbrli:scenario children must be dimension members only

Performance: reuses already-parsed lxml tree from ctx.xml_instance.root
when available, avoiding redundant XML parsing across all six rules.

Closes #75

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move XML parsing into the engine: xml_root is computed once and passed
to every ValidationContext via a new xml_root attribute. All rule modules
(xml_wellformedness, xml_root_element, xml_schema_ref,
xml_filing_indicators, xml_context) now use ctx.xml_root instead of
calling etree.fromstring() independently.

Before: XML parsed N times (once per rule that needs the tree).
After:  XML parsed exactly once in run_validation(), shared across all rules.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add four fact validation rules:
- XML-040: @precision must not be used (use @dECIMALS) [EBA]
- XML-041: @dECIMALS value must be valid integer or "INF" [non-EBA]
- XML-042: @xsi:nil must not be used on facts [EBA]
- XML-043: string-type facts must not be empty [EBA]

Facts are identified as direct root children not in infrastructure
namespaces (xbrli, link, find) using a frozenset for O(1) lookup.
All rules use ctx.xml_root (single-parse architecture).

Closes #76

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Single-pass scan architecture: one iteration over every element
collects prohibited elements/attributes (XML-060..064), contextRef/
unitRef inventories (XML-066, 068), and context/unit elements for
duplicate detection (XML-067, 069). Result is cached per root so
all 10 rules reuse the same scan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hecks

XML-070: fact concepts must be defined in the module taxonomy
XML-071: explicit dimension QNames must be defined in the taxonomy
XML-072: dimension member values must be valid for their dimension

Uses cached taxonomy extraction from Module and single-pass XML scan.
Open key dimensions skip member validation. Supports both datapoints
and headers architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Variable.from_dict strips namespace prefixes from dimension keys
(e.g., "eba_dim:BAS" → "BAS"). The taxonomy extraction now handles
both prefixed and bare localname keys, matching dimensions by
localname only. Member values retain their prefix and are still
resolved via namespace URI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Regression tests that construct variables through Variable.from_dict
(the real production deserialization path) to ensure dimension key
prefix stripping does not break taxonomy validation.

Covers: concept resolution, dimension matching with stripped prefixes,
member validation, and versioned prefixes (e.g., eba_dim_4.0:BAS).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add *.skill to .gitignore
- Remove unused import and sort imports in test_validation_engine.py
- Add submission package naming rules (section 3) to validations spec

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dentifier checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…urrency checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… unit checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… additional checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
antonio-olleros and others added 21 commits March 2, 2026 12:35
Add EBA-NAME-071: the root folder inside a CSV report package ZIP must
match the ZIP filename stem. Also fixes _FRAMEWORK_VERSION_RE regex to
accept PILLAR3-style framework codes, and adds smoke tests proving
EBA-NAME-001..060 already work for CSV via format-agnostic dispatch.

Closes #102

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce a shared_cache dict on ValidationContext, created once per
run_validation() call and reused across all rules.  This eliminates
~100 redundant ZIP opens per file by caching parsed data (namelist,
report.json, parameters.csv, FilingIndicators.csv, data tables,
variable lookup, namespace map, zip root prefix).

Additional optimisations:
- Skip reading CSV ZIP bytes into memory (unused by CSV rules)
- Cache module index at module level
- Centralise duplicated _build_variable_lookup into _helpers.py
- Remove redundant second ZipFile.extractall in CsvInstance.parse()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bump version to 2.0.0rc2. Adds full CSV-side validation (structural
rules CSV-001..CSV-062, EBA rules for entity, decimals, units,
currency, guidance, and naming). Includes shared-cache performance
optimisation eliminating ~100 redundant ZIP opens per file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-monetary facts (pure unit, no unit) in a denomination context
are valid non-currency metrics (e.g. percentages, counts) and
should not be flagged by the currency-of-denomination rule.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add validation sections to README.rst covering CLI usage (xbridge validate
subcommand with --eba, --post-conversion, --json options) and Python API
(validate() function with dict-based return format examples).

Fix docs/validation.rst to accurately reflect the dict-based return format
of validate() instead of the outdated ValidationResult object-style API.

Update docs/index.rst What's New section from 1.5.x to 2.0.0rc2/rc1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a combined pipeline that validates an XBRL-XML file before
converting, then validates the resulting CSV post-conversion.
Available as both CLI flags (--validate, --eba) and Python API
parameters (validate=, eba=), defaulting to off.

- Add ValidationError exception with results/path attributes
- Add validate/eba parameters to convert_instance()
- Add --validate and --eba flags to convert CLI command
- Update README.rst and docs/cli.rst with new options
- Add pipeline tests (9 tests covering all branches)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…'false', '1', or '0', correct CSV-025 to ensure that only reported datapoints are considered.
…R-era values in reportPackage.json and report.json, so converted files now pass CSV-003 and CSV-011 validation.
  Scenario.parse() crashed with IndexError on dimension attributes lacking
  a colon (e.g. dimension="qCAA"), which silently prevented XmlInstance
  from loading and caused all taxonomy-based validation rules (XML-070/
  071/072) to be skipped.

  - Fix split logic in Scenario.parse() to handle unprefixed dimensions
  - Add fallback module_ref extraction in the validation engine so
    taxonomy rules can still run when XmlInstance parsing fails
This commit introduces a new documentation file detailing the validation rules supported by xbridge. The rules are categorized into XML Instance Rules, CSV Report Package Rules, and Submission Package Naming Rules, each with unique identifiers, severity levels, and descriptions. The documentation also includes attributes controlling rule execution, input format detection, and a summary of rule coverage across formats. This enhancement aims to provide clear guidance for users on compliance and validation requirements.
Comment thread tests/test_eba_entity.py Fixed

@javihern98 javihern98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! 😊

@antonio-olleros antonio-olleros merged commit 539bfbd into main Mar 18, 2026
16 checks passed
@javihern98 javihern98 deleted the feat/add-validations branch June 1, 2026 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants