Addressed review suggestions in issue #1178 by VisLab · Pull Request #1266 · hed-standard/hed-python

VisLab · 2026-03-13T15:32:47Z

@copilot @claude Evaluate whether this PR addresses issues raised in #1178 and whether other improvements should be made.

Copilot

Pull request overview

This PR addresses issue #1178 by tightening validation and handling of hedId values across schema validation and TSV/dataframe IO, and adding/adjusting regression tests around these behaviors.

Changes:

Added a new schema compliance check to detect duplicate hedId values across all schema sections and extended the existing SCHEMA_HED_ID_INVALID message to report duplicates.
Fixed multiple hedId-related edge cases in dataframe/schema synchronization (skip extras correctly, guard out-of-range checks when the valid-range set is empty, and ensure ID assignment mutates the original DataFrame).
Updated and expanded unit tests to cover invalid hedId formats, empty/unknown library ranges, dataframe mutation, and mismatch detection.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/schema/test_schema_validator_hed_id.py	Strengthens `HedIDValidator.verify_tag_id` assertions and adds invalid-format coverage.
tests/schema/test_schema_compliance.py	Updates expected compliance check list to include `duplicate_hed_ids`.
tests/schema/test_hed_id_util.py	Adds regression tests for empty-range handling, dataframe mutation, and mismatch detection.
hed/schema/schema_validation/hed_id_validator.py	Tweaks “nothing to verify” condition for `hedId` validation.
hed/schema/schema_validation/compliance.py	Adds `check_duplicate_hed_ids()` and wires it into compliance flow.
hed/schema/schema_io/hed_id_util.py	Fixes extras skipping, guards range checks on empty sets, and corrects dataframe mutation during ID assignment.
hed/schema/schema_io/df_util.py	Stops silently swallowing OS errors when loading TSVs; only ignores missing files.
hed/errors/schema_error_messages.py	Extends `SCHEMA_HED_ID_INVALID` message formatting to support duplicate-id reporting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

github-actions · 2026-03-13T15:37:20Z

PR Review — fix_extras (#1266)

This PR addresses three pre-existing bugs in schema hedId handling and adds a new duplicate-ID compliance check. Overall the changes are correct and well-tested.

Critical bug fixed: `DF_SUFFIXES` → `DF_EXTRAS` (`hed_id_util.py:93`)

The most important change. DF_SUFFIXES contains all known section keys, so the original guard if df_key in constants.DF_SUFFIXES: continue caused update_dataframes_from_schema to skip every content section — making the entire hedId verification loop a no-op. No mismatch was ever detected, and the existing test masked this by asserting on a dead except branch. Changing the guard to DF_EXTRAS (which holds only Prefixes/ExternalAnnotations/Sources) correctly limits the skip to extras sections only.

pandas mutation bug fixed: `df.at[_row_number, ...]` (`hed_id_util.py:216`)

df.iterrows() yields copies of rows; assigning row[col] = value mutates the copy, not the DataFrame. Using df.at[_row_number, col] writes directly to the DataFrame. test_assign_actually_mutates_df verifies the fix.

Potential TypeError fixed: `not new_id` guard (`hed_id_validator.py:99`)

When new_id = "" (entry has no hedId) and the old code reached new_id < starting_id, Python 3 would raise TypeError: '<' not supported between instances of 'str' and 'int'. The not new_id guard returns early before the range check is reached.

New check: `check_duplicate_hed_ids` (`compliance.py`)

Correctly traverses all sections and accumulates seen IDs in a dict. One minor note: seen_ids maps hedId → entry.name without recording the section, so the error message for a cross-section collision (e.g. a Tag and a Unit sharing an ID) will show only the tag name, not its section. See inline comment for detail.

Test fixes

The original test_verify_tag_id used self.assertTrue("It has changed", issues[0]["message"]) — assertTrue(expr, msg) with a non-empty string as the first arg is always truthy; the test was not actually checking message content. The split into proper assertIn assertions is a genuine correctness fix.

Summary: The PR is ready to merge. One suggestion on the cross-section error message, otherwise no blocking issues.

github-actions · 2026-03-13T16:28:34Z

PR Review: Fix extras and HED ID validation improvements

This PR fixes three real bugs and adds a new duplicate-hedId compliance check. Overall the changes are correct and well-structured. One important gap and one minor issue are noted below.

What this PR does well:

DF_SUFFIXES to DF_EXTRAS (hed_id_util.py line 93): Most impactful fix. The original loop used DF_SUFFIXES (all section keys), so validation was silently a no-op. Changing to DF_EXTRAS makes validation run on TAG, UNIT, etc. sections.
df.at in assign_hed_ids_section: Correct fix. Writing to the row copy from iterrows() never persisted back; df.at[_row_number, ...] does.
Empty-range guard in _verify_hedid_matches: The guard correctly avoids the min()/max() crash on an empty set.
_get_old_id_int refactor: Clean extraction that eliminates duplicated error-handling logic and also fixes the silent old_id type-mutation bug in the original code.
verify_tag_id fix: Changing 'if new_id is None' to 'if not new_id_str' correctly handles the empty-string case (the original guard was dead code since new_id started as empty string, never None).
df_util.py OSError split: Distinguishing FileNotFoundError (expected, silent) from other OSError (unexpected, raise) is correct.

Important:
check_duplicate_hed_ids has no positive test. The only new test for this compliance check verifies it appears in the summary list (test_has_all_checks). There is no test that creates a schema with deliberately duplicated hedId values and asserts the check fires with the correct error. Without a positive test, a regression here would be silent.

Suggestion:
In check_duplicate_hed_ids, str(section_key) on a plain Enum member renders as 'HedSectionKey.Tags', not 'Tags'. The error message would read: 'in the HedSectionKey.Tags section'. Using section_key.value (produces 'tags') or section_key.name (produces 'Tags') would be more readable in user-facing output.

github-actions · 2026-03-13T16:53:18Z

PR Review summary for #1266

Overall: the PR is well-reasoned and addresses the issues from #1178 with no blocking problems. All five bug fixes are correct and the new check_duplicate_hed_ids compliance check is well-implemented.

Bug fixes verified:

Dead-code verification loop (important). update_dataframes_from_schema guarded with DF_SUFFIXES, which contains every known df key, so _verify_hedid_matches was never called. Switching to DF_EXTRAS (the three extras-only keys) makes hedId mismatch detection actually execute. The previous test silently passed because no HedFileError was raised; the replacement test_conflict_detected using assertRaises correctly catches this.
DataFrame write-back in assign_hed_ids_section (important). row[constants.hed_id] modified a pandas row copy rather than the original DataFrame. Using df.at[_row_number, ...] writes through to the original. The new test_assign_actually_mutates_df verifies this directly.
Empty-set crash in _verify_hedid_matches (correctness). min(unused_tag_ids) and max(unused_tag_ids) raised ValueError on an empty set for unknown libraries. The added guard fixes this cleanly.
Latent dead early-exit in verify_tag_id (correctness). new_id started as an empty string, so the condition new_id is None was always False and the early return never fired. The refactored new_id_str / new_id split makes the guard correct.
FileNotFoundError vs OSError in df_util.py (correctness). Missing files are valid for partial schemas and are now handled with pass; other OS errors are properly re-raised as HedFileError.

Minor issue: In schema_error_messages.py the new duplicate-id error branch has a double space before Each hedId must be unique. The other messages in the same function use a single space after the period. Not a blocker but worth a one-character fix.

github-actions · 2026-03-13T16:53:23Z

+    if duplicate_tag:
+        section_info = f" in the '{duplicate_tag_section}' section" if duplicate_tag_section else ""
+        return (
+            f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}.  "


Suggestion: double space before "Each hedId" — should be a single space to match the style of the other messages in this function.

Suggested change

f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}. "

f"Tag '{tag}' has hedId '{new_id}' which is already used by '{duplicate_tag}'{section_info}. "

Will fix on next PR.

Addressed review suggestions in issue hed-standard#1178

6ae48c4

VisLab requested a review from Copilot March 13, 2026 15:32

Copilot started reviewing on behalf of VisLab March 13, 2026 15:33 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread hed/schema/schema_validation/hed_id_validator.py Outdated

Comment thread hed/schema/schema_validation/compliance.py

Comment thread tests/schema/test_schema_compliance.py Outdated

Comment thread tests/schema/test_hed_id_util.py Outdated