Skip to content

Addressed review suggestions in issue #1178#1266

Merged
VisLab merged 5 commits into
hed-standard:mainfrom
VisLab:fix_extras
Mar 13, 2026
Merged

Addressed review suggestions in issue #1178#1266
VisLab merged 5 commits into
hed-standard:mainfrom
VisLab:fix_extras

Conversation

@VisLab

@VisLab VisLab commented Mar 13, 2026

Copy link
Copy Markdown
Member

@copilot @claude Evaluate whether this PR addresses issues raised in #1178 and whether other improvements should be made.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses issue #1178 by tightening validation and handling of hedId values across schema validation and TSV/dataframe IO, and adding/adjusting regression tests around these behaviors.

Changes:

  • Added a new schema compliance check to detect duplicate hedId values across all schema sections and extended the existing SCHEMA_HED_ID_INVALID message to report duplicates.
  • Fixed multiple hedId-related edge cases in dataframe/schema synchronization (skip extras correctly, guard out-of-range checks when the valid-range set is empty, and ensure ID assignment mutates the original DataFrame).
  • Updated and expanded unit tests to cover invalid hedId formats, empty/unknown library ranges, dataframe mutation, and mismatch detection.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/schema/test_schema_validator_hed_id.py Strengthens HedIDValidator.verify_tag_id assertions and adds invalid-format coverage.
tests/schema/test_schema_compliance.py Updates expected compliance check list to include duplicate_hed_ids.
tests/schema/test_hed_id_util.py Adds regression tests for empty-range handling, dataframe mutation, and mismatch detection.
hed/schema/schema_validation/hed_id_validator.py Tweaks “nothing to verify” condition for hedId validation.
hed/schema/schema_validation/compliance.py Adds check_duplicate_hed_ids() and wires it into compliance flow.
hed/schema/schema_io/hed_id_util.py Fixes extras skipping, guards range checks on empty sets, and corrects dataframe mutation during ID assignment.
hed/schema/schema_io/df_util.py Stops silently swallowing OS errors when loading TSVs; only ignores missing files.
hed/errors/schema_error_messages.py Extends SCHEMA_HED_ID_INVALID message formatting to support duplicate-id reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread hed/schema/schema_validation/hed_id_validator.py Outdated
Comment thread hed/schema/schema_validation/compliance.py
Comment thread tests/schema/test_schema_compliance.py Outdated
Comment thread tests/schema/test_hed_id_util.py Outdated
@github-actions

Copy link
Copy Markdown

PR Review — fix_extras (#1266)

This PR addresses three pre-existing bugs in schema hedId handling and adds a new duplicate-ID compliance check. Overall the changes are correct and well-tested.


Critical bug fixed: DF_SUFFIXESDF_EXTRAS (hed_id_util.py:93)

The most important change. DF_SUFFIXES contains all known section keys, so the original guard if df_key in constants.DF_SUFFIXES: continue caused update_dataframes_from_schema to skip every content section — making the entire hedId verification loop a no-op. No mismatch was ever detected, and the existing test masked this by asserting on a dead except branch. Changing the guard to DF_EXTRAS (which holds only Prefixes/ExternalAnnotations/Sources) correctly limits the skip to extras sections only.

pandas mutation bug fixed: df.at[_row_number, ...] (hed_id_util.py:216)

df.iterrows() yields copies of rows; assigning row[col] = value mutates the copy, not the DataFrame. Using df.at[_row_number, col] writes directly to the DataFrame. test_assign_actually_mutates_df verifies the fix.

Potential TypeError fixed: not new_id guard (hed_id_validator.py:99)

When new_id = "" (entry has no hedId) and the old code reached new_id < starting_id, Python 3 would raise TypeError: '<' not supported between instances of 'str' and 'int'. The not new_id guard returns early before the range check is reached.

New check: check_duplicate_hed_ids (compliance.py)

Correctly traverses all sections and accumulates seen IDs in a dict. One minor note: seen_ids maps hedId → entry.name without recording the section, so the error message for a cross-section collision (e.g. a Tag and a Unit sharing an ID) will show only the tag name, not its section. See inline comment for detail.

Test fixes

The original test_verify_tag_id used self.assertTrue("It has changed", issues[0]["message"])assertTrue(expr, msg) with a non-empty string as the first arg is always truthy; the test was not actually checking message content. The split into proper assertIn assertions is a genuine correctness fix.


Summary: The PR is ready to merge. One suggestion on the cross-section error message, otherwise no blocking issues.

Comment thread hed/schema/schema_io/hed_id_util.py
Comment thread hed/schema/schema_validation/compliance.py Outdated
Comment thread hed/schema/schema_validation/hed_id_validator.py
@github-actions

Copy link
Copy Markdown

PR Review: Fix extras and HED ID validation improvements

This PR fixes three real bugs and adds a new duplicate-hedId compliance check. Overall the changes are correct and well-structured. One important gap and one minor issue are noted below.

What this PR does well:

  • DF_SUFFIXES to DF_EXTRAS (hed_id_util.py line 93): Most impactful fix. The original loop used DF_SUFFIXES (all section keys), so validation was silently a no-op. Changing to DF_EXTRAS makes validation run on TAG, UNIT, etc. sections.
  • df.at in assign_hed_ids_section: Correct fix. Writing to the row copy from iterrows() never persisted back; df.at[_row_number, ...] does.
  • Empty-range guard in _verify_hedid_matches: The guard correctly avoids the min()/max() crash on an empty set.
  • _get_old_id_int refactor: Clean extraction that eliminates duplicated error-handling logic and also fixes the silent old_id type-mutation bug in the original code.
  • verify_tag_id fix: Changing 'if new_id is None' to 'if not new_id_str' correctly handles the empty-string case (the original guard was dead code since new_id started as empty string, never None).
  • df_util.py OSError split: Distinguishing FileNotFoundError (expected, silent) from other OSError (unexpected, raise) is correct.

Important:
check_duplicate_hed_ids has no positive test. The only new test for this compliance check verifies it appears in the summary list (test_has_all_checks). There is no test that creates a schema with deliberately duplicated hedId values and asserts the check fires with the correct error. Without a positive test, a regression here would be silent.

Suggestion:
In check_duplicate_hed_ids, str(section_key) on a plain Enum member renders as 'HedSectionKey.Tags', not 'Tags'. The error message would read: 'in the HedSectionKey.Tags section'. Using section_key.value (produces 'tags') or section_key.name (produces 'Tags') would be more readable in user-facing output.

Comment thread tests/schema/test_schema_compliance.py
Comment thread hed/schema/schema_validation/compliance.py Outdated
@github-actions

Copy link
Copy Markdown

PR Review summary for #1266

Overall: the PR is well-reasoned and addresses the issues from #1178 with no blocking problems. All five bug fixes are correct and the new check_duplicate_hed_ids compliance check is well-implemented.

Bug fixes verified:

  1. Dead-code verification loop (important). update_dataframes_from_schema guarded with DF_SUFFIXES, which contains every known df key, so _verify_hedid_matches was never called. Switching to DF_EXTRAS (the three extras-only keys) makes hedId mismatch detection actually execute. The previous test silently passed because no HedFileError was raised; the replacement test_conflict_detected using assertRaises correctly catches this.

  2. DataFrame write-back in assign_hed_ids_section (important). row[constants.hed_id] modified a pandas row copy rather than the original DataFrame. Using df.at[_row_number, ...] writes through to the original. The new test_assign_actually_mutates_df verifies this directly.

  3. Empty-set crash in _verify_hedid_matches (correctness). min(unused_tag_ids) and max(unused_tag_ids) raised ValueError on an empty set for unknown libraries. The added guard fixes this cleanly.

  4. Latent dead early-exit in verify_tag_id (correctness). new_id started as an empty string, so the condition new_id is None was always False and the early return never fired. The refactored new_id_str / new_id split makes the guard correct.

  5. FileNotFoundError vs OSError in df_util.py (correctness). Missing files are valid for partial schemas and are now handled with pass; other OS errors are properly re-raised as HedFileError.

Minor issue: In schema_error_messages.py the new duplicate-id error branch has a double space before Each hedId must be unique. The other messages in the same function use a single space after the period. Not a blocker but worth a one-character fix.

if duplicate_tag:
section_info = f" in the '{duplicate_tag_section}' section" if duplicate_tag_section else ""
return (
f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}. "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: double space before "Each hedId" — should be a single space to match the style of the other messages in this function.

Suggested change
f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}. "
f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}. "

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix on next PR.

@VisLab VisLab merged commit 363b544 into hed-standard:main Mar 13, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants