TMA benchmarks will be running without grid constant TMA descriptor.
(APIServer pid=7998) INFO 12-13 00:56:05 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev105+gfdc135d76
(APIServer pid=7998) INFO 12-13 00:56:05 [utils.py:253] non-default args: {'model_tag': 'pytorch/gemma-3-12b-it-FP8', 'model': 'pytorch/gemma-3-12b-it-FP8'}
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:514] Resolved architecture: Gemma3ForConditionalGeneration
(APIServer pid=7998) INFO 12-13 00:56:05 [model.py:1636] Using max model len 131072
(APIServer pid=7998) INFO 12-13 00:56:05 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=7998) WARNING 12-13 00:56:05 [cuda.py:244] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
TMA benchmarks will be running without grid constant TMA descriptor.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:13 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev105+gfdc135d76) with config: model='pytorch/gemma-3-12b-it-FP8', speculative_config=None, tokenizer='pytorch/gemma-3-12b-it-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=torchao, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=pytorch/gemma-3-12b-it-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.0.31.154:56647 backend=nccl
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:15 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=8153) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:22 [gpu_model_runner.py:3562] Starting to load model pytorch/gemma-3-12b-it-FP8...
(EngineCore_DP0 pid=8153) /usr/local/lib/python3.12/dist-packages/torch/__init__.py:1617: UserWarning: Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
(EngineCore_DP0 pid=8153) _C._set_float32_matmul_precision(precision)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:24 [layer.py:537] Using AttentionBackendEnum.FLASH_ATTN for MultiHeadAttention in multimodal encoder.
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:26 [cuda.py:412] Using FLEX_ATTENTION attention backend out of potential backends: ['FLEX_ATTENTION']
Loading pt checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading pt checkpoint shards: 33% Completed | 1/3 [00:02<00:05, 2.80s/it]
Loading pt checkpoint shards: 67% Completed | 2/3 [00:05<00:02, 2.73s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.25s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:07<00:00, 2.39s/it]
(EngineCore_DP0 pid=8153)
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [default_loader.py:308] Loading weights took 7.16 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:3659] Model loading took 12.8828 GiB memory and 11.215734 seconds
(EngineCore_DP0 pid=8153) INFO 12-13 00:56:34 [gpu_model_runner.py:4446] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 31 image items of the maximum feature size.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] EngineCore failed to start.
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] Traceback (most recent call last):
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 857, in run_engine_core
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 637, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] super().__init__(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 109, in __init__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 240, in _initialize_kv_caches
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 340, in determine_available_memory
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] self.model_runner.profile_run()
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4462, in profile_run
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] dummy_encoder_outputs = self.model.embed_multimodal(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 603, in embed_multimodal
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._process_image_input(image_input)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 587, in _process_image_input
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] image_features = self._image_pixels_to_features(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma3_mm.py", line 576, in _image_pixels_to_features
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return vision_tower(pixel_values)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 856, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self.vision_model(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 754, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] encoder_outputs = self.encoder(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 562, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] hidden_states, _ = encoder_layer(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 511, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] hidden_states, _ = self.self_attn(hidden_states=hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/siglip.py", line 429, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] qkv_states, _ = self.qkv_proj(hidden_states)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 565, in forward
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] output_parallel = self.quant_method.apply(self, input_, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/torchao.py", line 348, in apply
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return F.linear(x, layer.weight, bias)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 655, in _dispatch__torch_function__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return cls._ATEN_OP_OR_TORCH_FN_TABLE[cls][func](func, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/utils.py", line 491, in wrapper
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return func(f, types, args, kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torchao/quantization/quantize_/workflows/float8/float8_tensor.py", line 300, in _
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] res = torch.ops.fbgemm.f8f8bf16_rowwise(
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=8153) ERROR 12-13 00:56:35 [core.py:866] RuntimeError: cutlass cannot initialize
When running vLLM benchmark of
pytorch/gemma-3-12b-it-FP8on B200, I notice that the benchmark starts failing withfbgemm-gpu-genai=1.4.2https://github.com/pytorch/pytorch-integration-testing/actions/runs/20177479851/job/57929181213#step:19:2183. The error when runningvllm serve pytorch/gemma-3-12b-it-FP8is as follows:The same command
vllm serve pytorch/gemma-3-12b-it-FP8works fine with the previous versionfbgemm-gpu-genai=1.4.1. The same error also happens withpytorch/gemma-3-27b-it-fp8Here is the full list of pip packages:
cc @jainapurva