Implement molecular connectivity checks by jagritisahoo · Pull Request #2002 · facebookresearch/fairchem

jagritisahoo · 2026-05-16T12:51:46Z

Implemented molecular connectivity checks at two stages in the FastCSP workflow
Post genarris stage

check if request Z value is preserved and added a column validity.crystal_generated.correct_z to the parquet
check if reference molecule matches the disconnected graphs in the generated crystal, added a column validity.crystal_generated.molecule_matches_reference to the parquet
Implement remove_problematic for post genarris step to add an option of tagging and removing problematic structures

Post relaxation stage

check if Z is preserved i.e. the final atoms has the same z value (reads z from structure_df), stored in column validity.crystal_relaxed.correct_z
check if reference molecule matches the disconnected graphs in the relaxed crystal, added a column validity.crystal_relaxed.molecule_matches_reference to the parquet
check if JMolNN bond-matrix is preserved between initial_atoms and relaxed_atoms, stored in column validity.crystal_relaxed.connectivity_unchanged

new helpers in core/utils/structure.py: - reference_graph_from_atoms builds an nx.Graph (atomic_num node attr, no bond order) from a reference conformer .xyz/.sdf/.mol. - check_molecule_matches_reference does per-fragment graph isomorphism vs the reference graph via nx.is_isomorphic with categorical_node_match on atomic_num. process_generated.py: - Build the reference graph once per process_genarris_outputs_single (one call per (mol_id, conf_id)) by looking for <conf_id>.{xyz,sdf,mol} inside the conformer directory. - Thread the reference graph into structure_to_row; emit a new column validity.crystal_generated.molecule_matches_reference alongside the existing validity.crystal_generated.correct_z. relax.py / filter.py: - Rename validity.connectivity_unchanged to validity.crystal_relaxed.connectivity_unchanged for namespace consistency with validity.crystal_relaxed.z_unchanged.

Drops the structures whose generation-time validity flags (validity.crystal_generated.correct_z, validity.crystal_generated.molecule_matches_reference) are False before deduplication and writing the parquets into the directory raw structures. Split into structures_df_filtered / problematic_structures_df, run deduplicate_structures only on the valid subset, mark problematic rows with group_index=-1, and optionally reintegrate. Default is False (preserve), matching get_post_relax_config.

…atrix

…graphs in the relaxed polymorph - add validity.crystal_relaxed.molecule_matches_reference

lbluque

thx @jagritisahoo - these checks look much cleaner and more robust. I just left some small comments.

lbluque · 2026-05-22T18:32:08Z

+        # XYZ-loaded molecules have no unit cell (cell rank < 3), which makes
+        # AseAtomsAdaptor.get_structure raise LinAlgError on the singular
+        # lattice. Pad with a generously large cubic box so the molecule sits
+        # well inside and pymatgen can build a periodic Structure for JmolNN.
+        if np.linalg.matrix_rank(np.array(reference_atoms.cell)) < 3:
+            reference_atoms = reference_atoms.copy()
+            reference_atoms.cell = np.eye(3) * 30.0
+            reference_atoms.center()
+            reference_atoms.pbc = True


Not necessary to change this, but pmg has a Molecule object as well, and likely JMolNN or equivalent that works on those. We might be able to just do that directly to avoid this.

I think so

from pymatgen.analysis.graphs import MoleculeGraph from pymatgen.analysis.local_env import JMolNN molecule_graph = MoleculeGraph.with_local_env_strategy(my_molecule, JMolNN()) neighbors_list = molecule_graph.get_neighbors(site_index)

lbluque · 2026-05-22T18:36:48Z

+        # Build the nx.Graph (atomic_num node attr; undirected edges
+        graph = nx.Graph()
+        for i in range(n):
+            graph.add_node(i, atomic_num=structure[i].specie.number)
+        for i in range(n):
+            for entry in nn_info[i]:
+                j = entry["site_index"]
+                if i < j:
+                    graph.add_edge(i, j)
+        return graph


Does this method work here to clean things up? https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_numpy_array.html

This is implemented now. Could you check the latest implementation @lbluque ?

lbluque · 2026-05-22T18:40:43Z

-    # Any difference indicates bond formation/breaking during relaxation
-    return np.array_equal(initial_nn_matrix, final_nn_matrix)
+
+def check_connectivity_changes(


Do we still need this if we are using the method above check_molecule_matches_reference? That one looks a lot more robust, since ordering is not an issue. I suggest just removing this one altogether since it could lead to false negative being dropped due to site permutations.

@lbluque yeah I think this is taken care by check_molecules_matches_reference and we probably do not need this. However, we can still keep the check_correct_z as a pre-filter post relaxation.

lbluque · 2026-05-22T18:42:16Z

+        if check_molecule_count:
+            initial_molecule_count = int(
+                csgraph.connected_components(initial_nn_matrix)[0]
+            )
+            final_molecule_count = int(csgraph.connected_components(final_nn_matrix)[0])
+            result["initial_molecule_count"] = initial_molecule_count
+            result["final_molecule_count"] = final_molecule_count
+            molecule_count_preserved = initial_molecule_count == final_molecule_count
+            result["molecule_count_preserved"] = molecule_count_preserved
+            if not molecule_count_preserved:
+                result["no_changes"] = False


This part isnt affected by site re-orderings, but is it different than checking z_changes with the function above? if not, i suggest just keeping one of them.

@lbluque It is slightly different in the sense that the function check_correct_z compares the number of molecules in initial structure with the request Z which was used as an input to Genarris. This is checking the number of Z before relaxation and compares that to the Z after relaxation, essentially capturing bond breaking/fusing of molecules. Some of the code, such as building the nn_matrix can be abstracted out for sure.

levineds · 2026-05-22T19:43:41Z

So this doesn't include the code changes I made in https://github.com/fairinternal/generative_chemistry/pull/42/changes#diff-2e55eeb5b4c279f4e75e69b5ada39350bbf5541c0ff692de7c05bb2b72aad3fe which add stereochemistry checking. I think we really need these because we had a number of cases where the mispredictions were because the "lowest" energy was a different diastereomer and so it threw out all of the correct stereoisomer structures. Note that I don't believe I had a flag in my code for noting racemates (i.e. crystals when both enantiomers are present). We could easily just not worry about enantiomers and only diastereomers because if we had the full enantiomer of a crystal than the energy is the same, and I don't think genarris+UMA can produce non-racemate enatiomeric mixtures.

…ges with reference-anchored correct_z + molecule_matches_reference - structure.py: refactor reference_graph_from_atoms and check_molecule_matches_reference to use nx.from_numpy_array; delete check_connectivity_changes + check_no_changes_in_covalent_matrix + check_no_changes_in_Z - relax.py: drop atoms_list_original snapshot; write validity.crystal_relaxed.correct_z (replaces .z_unchanged and .connectivity_unchanged) - filter.py: rewrite problematic mask to use correct_z + molecule_matches_reference; keep root_unrelaxed as opt-in toggle to recompute post-relax flags on the relaxed CIF when relax did not write them; add generated_structures_dir param for the reference graph - main.py: pass root_unrelaxed and generated_structures_dir to filter so the recompute runs by default

# Conflicts: # src/fairchem/applications/fastcsp/core/workflow/filter.py

- add jmolnn_adjacency(structure_or_atoms) helper refactor check_correct_z, reference_graph_from_atoms, and check_molecule_matches_reference to call it - add check_connectivity_unchanged(initial, final) for the strict, site-ordered init->final JmolNN bond-matrix comparison

lbluque · 2026-05-26T22:34:20Z

+        # Build undirected nx.Graph from the adjacency matrix and attach
+        # atomic_num as a per-node attribute (used by the categorical node
+        # match in check_molecule_matches_reference).
+        graph = nx.from_numpy_array(nn_matrix)


lg! I think you can pass the nodes directly to this function as well, instead of looping like below.

gvahe and others added 30 commits October 2, 2025 20:01

deal with duplicating options

bdd956f

add nsteps

9cf0ccf

fix output path for eval

e3f7ad5

Merge remote-tracking branch 'origin' into fastcsp_updates

3218657

add debug to deduplicate

6be99a0

move utils to logging

f31d92f

move utils to logging

357582f

add option of reading structures generated by other methods

95b61df

cleanup

04b627f

update config to skip deduplication if requested

6131f60

add lic back

5b3ab89

add option to save problematic structures

b238076

minor

f9e3f69

add more config parameters to refine design space

7719133

csd match with subprocess patch

14ce78b

Merge branch 'main' into fastcsp_updates

7685e3e

minor fix

b04cec4

add array-parallelism

8ef9dc2

turn off stdout optimizer writing by default

e134b6a

update readmes with pr updates

03fe545

Merge branch 'main' into fastcsp_updates

1302bd9

Fixed CSD Evaluation

39387b6

Fixed CSD Evaluation

60804b8

Merge branch 'main' into fastcsp_updates

c6a206b

Merge remote-tracking branch 'origin/main' into fastcsp_updates

4442616

minor updates

f4526fe

fix deduplicate_structures call

13b7366

Merge branch 'main' into fastcsp_updates

f7e4aaf

Merge branch 'main' into fastcsp_updates

cb0652f

bump gnrs

e5cc597

gvahe added the enhancement New feature or request label May 20, 2026

jagritisahoo user and others added 10 commits May 20, 2026 22:14

put the reference atoms in a box to enable pymatgen to build the NN m…

f939087

…atrix

write group_index flag as string -1 not int

bf0c2e7

Set checkpoint to None for UMA 1p2 tasks

7bedb82

address Luis's comments

0cd2402

bind config to runner in ray entrypoint (#1995)

5f4ea37

Merge branch 'main' into fastcsp_updates

f7d1348

Implement connectivity check to compare conformers with disconnected …

cc660d2

…graphs in the relaxed polymorph - add validity.crystal_relaxed.molecule_matches_reference

adding conformer generation to fastcsp

267ba3a

gvahe requested a review from levineds May 22, 2026 18:44

lbluque requested changes May 22, 2026

View reviewed changes

gvahe and others added 11 commits May 22, 2026 21:50

address Daniel's comments

ebdbf3c

Merge branch 'fastcsp_updates' into fastcsp_correct_z_check

5ad5ce6

# Conflicts: # src/fairchem/applications/fastcsp/core/workflow/filter.py

add inspiration code link

6fa1a14

remove swifter dep

5c617c8

allow num_structures to differ

f12ff7c

default to pyarrow install

2b0686e

add starting spg to structure_id

ef07f69

log cif parse errors + remove ccdc-eval timeout by default

60c118d

Merge branch 'fastcsp_updates' into fastcsp_correct_z_check

c8cae06

lbluque reviewed May 26, 2026

View reviewed changes

gvahe changed the base branch from fastcsp_updates to main June 17, 2026 08:39

Merge remote-tracking branch 'origin/main' into fastcsp_correct_z_check

23368b2

jagritisahoo requested a review from lbluque June 18, 2026 15:45

Merge branch 'main' into fastcsp_correct_z_check

3076dfb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement molecular connectivity checks#2002

Implement molecular connectivity checks#2002
jagritisahoo wants to merge 68 commits into
mainfrom
fastcsp_correct_z_check

jagritisahoo commented May 16, 2026 •

edited

Loading

Uh oh!

lbluque left a comment

Uh oh!

lbluque May 22, 2026

Uh oh!

gvahe May 23, 2026

Uh oh!

lbluque May 22, 2026

Uh oh!

jagritisahoo May 22, 2026

Uh oh!

lbluque May 22, 2026

Uh oh!

jagritisahoo May 22, 2026

Uh oh!

lbluque May 22, 2026

Uh oh!

jagritisahoo May 22, 2026

Uh oh!

Uh oh!

levineds commented May 22, 2026

Uh oh!

lbluque May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

jagritisahoo commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lbluque left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

levineds commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jagritisahoo commented May 16, 2026 •

edited

Loading