compat megatron dev#87
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the configuration parser to keep num_query_groups, modifies _apply_rotary_pos_emb_bshd in GPTModel to support dynamic multi_latent_attention configuration, and adds logic to TransformerLayer for initializing offloading modules. A high-severity issue was identified in the GPTModel changes: the monkey-patched function captures the state of the first model instance, which will cause configuration conflicts in environments running multiple models.
| if multi_latent_attention is None: | ||
| multi_latent_attention = self.config.multi_latent_attention |
There was a problem hiding this comment.
The monkey-patched _apply_rotary_pos_emb_bshd function captures the self instance of the first GPTModel that triggers the patch. Since the patch is only applied once (due to the check at line 144), all subsequent GPTModel instances will use the multi_latent_attention configuration from the first instance, regardless of their own configuration. This will cause incorrect behavior in multi-model environments (e.g., different models in the same process or complex pipeline parallel setups).
No description provided.