Skip to content

🐛 [Bug] Torch-TRT does not translate softmax quantizer generated by modelopt fp8 mha quantization #4200

@nvyihengz

Description

@nvyihengz

Bug Description

The modelopt inserts QDQ pairs for BMM1/softmax/BMM2 to follow the TRT's fp8 MHA pattern however it seems Torch-trt is omitting the softmax quantizer and it caused no fused mha picked up by TRT

To Reproduce

Steps to reproduce the behavior:

  1. Launch the nvidia pytorch container: nvcr.io/nvidia/pytorch:26.03-py3
  2. Install transformers: pip install transformers
  3. Install modelopt nightly: pip install --upgrade "git+https://github.com/NVIDIA/Model-Optimizer.git@main"
  4. Run the attached scripts

vit_fp8_mha_qdq_inspect.log
vit_fp8_mha_qdq_inspect.py

Expected behavior

The softmax quantizers get converted to TRT qdq along with other bmm quantizers

Environment

See steps to reproduce

Additional context

@narendasan filed per discussion

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions