Bug Description
The modelopt inserts QDQ pairs for BMM1/softmax/BMM2 to follow the TRT's fp8 MHA pattern however it seems Torch-trt is omitting the softmax quantizer and it caused no fused mha picked up by TRT
To Reproduce
Steps to reproduce the behavior:
- Launch the nvidia pytorch container:
nvcr.io/nvidia/pytorch:26.03-py3
- Install transformers:
pip install transformers
- Install modelopt nightly:
pip install --upgrade "git+https://github.com/NVIDIA/Model-Optimizer.git@main"
- Run the attached scripts
vit_fp8_mha_qdq_inspect.log
vit_fp8_mha_qdq_inspect.py
Expected behavior
The softmax quantizers get converted to TRT qdq along with other bmm quantizers
Environment
See steps to reproduce
Additional context
@narendasan filed per discussion
Bug Description
The modelopt inserts QDQ pairs for BMM1/softmax/BMM2 to follow the TRT's fp8 MHA pattern however it seems Torch-trt is omitting the softmax quantizer and it caused no fused mha picked up by TRT
To Reproduce
Steps to reproduce the behavior:
nvcr.io/nvidia/pytorch:26.03-py3pip install transformerspip install --upgrade "git+https://github.com/NVIDIA/Model-Optimizer.git@main"vit_fp8_mha_qdq_inspect.log
vit_fp8_mha_qdq_inspect.py
Expected behavior
The softmax quantizers get converted to TRT qdq along with other bmm quantizers
Environment
See steps to reproduce
Additional context
@narendasan filed per discussion