Skip to content

Update custom_data_collation.ipynb to show how to use field-level collation#920

Open
SamSandwich07 wants to merge 2 commits into
mainfrom
field-level-collation-update-custom_data_collation.ipynb
Open

Update custom_data_collation.ipynb to show how to use field-level collation#920
SamSandwich07 wants to merge 2 commits into
mainfrom
field-level-collation-update-custom_data_collation.ipynb

Conversation

@SamSandwich07
Copy link
Copy Markdown
Collaborator

Update the custom_data_collation.ipynb notebook. One step in addressing #919.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@SamSandwich07 SamSandwich07 self-assigned this May 12, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.64%. Comparing base (b4cc80f) to head (d22a78e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #920   +/-   ##
=======================================
  Coverage   67.64%   67.64%           
=======================================
  Files          69       69           
  Lines        6802     6802           
=======================================
  Hits         4601     4601           
  Misses       2201     2201           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple notes, otherwise looking pretty good. Once these are resolved, I'm happy to approve. Thanks for putting this together!

Comment on lines +183 to +189
"Hyrax defines default behavior for fields which do not have a collate function defined.\n",
"The default behavior simply attempts to stack the given data into a single `numpy` array,\n",
"without adding padding or masking. While this works well for scalar data -- such as the\n",
"`object_id` and `label` fields, which do not have a custom `collate_*` function defined\n",
"in the example below for precisely this reason -- or data where a uniform shape is already\n",
"guaranteed, desired results may not be achieved in other scenarios; thus it is highly\n",
"recommended to define a collation function for nontrivial fields.\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section feels a bit wordy. The 3rd sentence should probably be broken up or simplified.

I would also recommend that the last section "thus it is highly ... for nontrival fields" could be reworded as something like "Hyrax exposes the collate_* API explicitly to handle complex or non-uniform data fields." (Or something like that)

"guaranteed, desired results may not be achieved in other scenarios; thus it is highly\n",
"recommended to define a collation function for nontrivial fields.\n",
"\n",
"As with the dataset-level `collate` function, in production the field-level methods would live in the dataset class file."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: I would recommend using a term other than "live in".

Suggested change
"As with the dataset-level `collate` function, in production the field-level methods would live in the dataset class file."
"As with the dataset-level `collate` function, in production the field-level methods would included in the dataset class file."

Comment on lines +247 to +285
@@ -188,6 +261,28 @@
"train_dataset = dataset[\"train\"]"
]
},
{
"cell_type": "markdown",
"id": "ec48546b",
"metadata": {},
"source": [
"We fix this by deleting the dataset-level collate function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eab6baf2",
"metadata": {},
"outputs": [],
"source": [
"delattr(HyraxRandomDataset, \"collate\")\n",
"dataset = h.prepare()\n",
"\n",
"# Access the \"train\" data group for clarity in the next steps\n",
"train_dataset = dataset[\"train\"]"
]
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combine this cell into the one just before. i.e. call delattr(...) first. Explain in the note above the cell that you're removing collate because otherwise Hyrax would raise an error.

This should simplify the notebook slightly and remove the issue with the expected error causing doc builds to break.

@SamSandwich07
Copy link
Copy Markdown
Collaborator Author

#921 includes the changes in this PR so once that one is approved I will close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants