Skip to content

Latest commit

 

History

History
201 lines (167 loc) · 7.4 KB

File metadata and controls

201 lines (167 loc) · 7.4 KB

Changelog

NVIDIA Megatron Core 0.12.0

  • Add FP8 recipe selection to arguments (--fp8-recipe, --first-last-layers-bf16, --num-layers-at-start-in-bf16, --num-layers-at-end-in-bf16)
  • Context parallel: fix loss scaling when calculate_per_token_loss=True
  • Make the number of data parallel communication buckets configurable (--ddp-num-buckets, --ddp-pad-buckets-for-high-nccl-busbw)
  • Inference
    • Support in-flight batching and chunked KV cache
    • Reduce memory usage,
      • by not materializing full attention mask
      • by only materializing logits for the last token during decode
      • by removing an obsolete tensor reference
  • Hybrid Model
    • Inference
      • Add CUDA graph support
      • Change tools/run_mamba_text_generation_server.py to use megatron.core.inference
      • Fix a shape issue when materializing logits for Mamba model
    • Improve initialization of Mamba layers
    • Add configuration switches (--mamba-state-dim, --mamba-head-dim, --mamba-num-groups, --is-hybrid-model)
    • Make num_floating_point_operations work with hybrid model
    • Make hybrid_conversion.py work with mixer that uses TE linear
    • Add FP8 support
    • Fix Mamba dt_bias tensor parallelism
    • Support multimodal tokenizer
    • Improve data parallelism scaling
  • MoE
    • Features:
      • DeepEP support, compatible with all the parallelisms and token drop / dropless
      • Important precision improvement: Enable FP32/FP64 routing and unpermutation using –moe-router-dtype. FP32 is recommended for all fine-grained MoE training
      • CUDA Graph support for MoE
      • Multi-Token Prediction (MTP) Support
      • Fused indices_to_multihot kernel for DeepEP dispatcher
    • Bug fixes:
      • Fix Hang Issue with MoE+Dense Hybrid models
      • Update theoretical memory and tflops estimation for MoE and MLA
      • Fix MoE Aux loss scaling for per token loss
      • Fixes for group limited routing and expert bias. We verified these fixes through dsv3 e2e verifications
    • Known issues:
      • The ckpt trained with Custom FSDP for MoE may not be compatible with 3D parallel training.

NVIDIA Megatron Core 0.11.0

  • Add multi datacenter training support though N/S connection
  • MoE
    • Features
      • Support DeepSeek-V3 fine-tuning
        • Aux-loss-free load balancing strategy
        • Node-limited routing and Device-limited routing support.
        • Tensor Parallelism support for MLA and Sequence Auxiliary Loss
        • MTP (with TP and PP support) is coming soon.
      • Permutation / Unpermutation fusion kernel from TransformerEngine.
      • Uneven virtual pipeline parallel split support in first and last PP stage.
    • Bug fixes:
      • Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
      • Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
    • Known Issues:
      • When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
  • Add MX-FP16 support for optimizer and master weights
  • CUDA Graph memory optimizations
  • Enable UCC backend for PP communication
  • Optimizer CPU offload support for memory savings
  • Models
    • Initial RADIO/CRADIO implementation
    • llama3.2 support
  • Hybrid Model
    • Support quantization via TensorRT Model Optimizer

NVIDIA Megatron Core 0.10.0

  • Adding MLA to MCore
  • Enable FP8 for GroupedMLP
  • MoE Parallel Folding
  • Enhance MoE Architecture: Support MoE Layer Frequency Patterns and Configurable MoE FFN Hidden Size
  • Multimodal: NVLM training and evaluation support in MCore
  • Mamba Hybrid
    • Increase performance and reduce memory footprint of Triton language/compiler distributed caching
    • Add more unit testing and fix bugs

NVIDIA Megatron Core 0.9.0

  • Uneven pipeline parallelism
    • Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
  • Per layer CUDAGraph support for GPT training with Transformer Engine modules
  • Enable different TP sizes for the vision encoder
  • Enable pipeline parallelism for T5 & Llava models
  • Support multi-tile multi-image input in Llava models
  • MoE
    • FP8 support
    • Runtime upcycling support
    • Dispatcher implementation optimizations
    • Shared expert support with overlapping optimizations
      • Qwen Model support
  • Known Issues
    • When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.
  • NVRx / Fault tolerance
    • fault and hang detection in addition to existing straggler detection
    • graceful exit and auto restart

NVIDIA Megatron Core 0.8.0

  • Multimodal
    • Added initial support for training vision language models using the LLaVA architecture
    • Added initial support for inference with multimodal inputs
    • End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
  • MoE
    • Context Parallel support.
    • Distributed checkpoint support for grouped GEMM.
  • Mamba

NVIDIA Megatron Core 0.7.0

  • MoE
    • Token drop support
    • Several efficiency optimizations
    • Improved model parallelism
    • Memory optimizations
  • Distributed checkpointing
    • Enabled for Retro
    • Asynchronous checkpoint saving
  • Several minor bug fixes, speed improvements, and memory optimizations

NVIDIA Megatron Core 0.6.0

  • MoE (Mixture of Experts)
    • Performance optimization
      • Communication optimization for multi GPU and Single GPU
      • 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
      • GroupedMLP enhancement for Hopper
      • DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
    • All-to-All based Token Dispatcher
    • Layer-wise logging for load balancing loss.
    • Improved expert parallel support including distributed optimizer.
  • Distributed optimizer
  • RETRO
    • Data processing
  • BERT
    • Distributed checkpointing
  • Dist checkpointing
    • PyTorch native distributed backend
    • Improved saving/loading speed
  • TensorRT-LLM Export
    • Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
    • Text generation driver to perform PTQ in Megatron-LM
    • Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
  • Several minor enhancements, bug fixes, and documentation updates

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Megatron core documentation is now live!

Model Features

  • MoE (Mixture of Experts)
    • Support for Z-loss, Load balancing and Sinkhorn
    • Layer and communications refactor
    • Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
    • Token dropless architecture with Top-K routing
    • Performance optimization with with GroupedGEMM when number of local experts is > 1
    • Distributed checkpointing
  • Interleaved rotary embedding

Datasets

  • Masked WordPiece datasets for BERT and T5
  • Raw and mock datasets

Parallelism

Performance

  • Activation offloading to CPU
  • Rope and Swiglu fusion
  • Sliding window attention (via Transformer Engine)

General Improvements

  • Timers

NVIDIA Megatron Core 0.4.0

Key Features and Enhancements

Models

  • BERT
  • RETRO
  • T5

Parallelism

  • Mixture of Experts support for GPT
  • Model parallel efficient Distributed Data Parallel (DDP)
  • Context Parallel (2D Tensor Parallel) support

Datasets

  • GPT Dataset
  • Blended Dataset