Flash Attention Benchmarks - Aggregated Results

This document combines benchmark results from multiple attention implementations using cross-file dependencies.

Combined Summary and Visualization

2025-10-02T20:00:13.145631 image/svg+xml Matplotlib v3.10.6, https://matplotlib.org/ flux_L128 flux_L256 flux_L320 flux_L384 flux_L448 flux_L512 Workload 0.4 0.5 0.6 0.7 0.8 0.9 Latency P50 (ms) Attention Implementation Latency torch_flash_ma torch_mem_eff xformers_meff torch_flash_compiled_default torch_flash_compiled_max_autotune hf_kernels_flash_attn hf_kernels_flash_attn3
Implementation Impl ID Workload Batch Seq Length Heads Head Dim Dtype Mean (ms) P10 (ms) P50 (ms) P90 (ms) Reps Peak Mem (MB) Backend Family
Flash (PyTorch SDPA) torch_flash_ma flux_L128 1 1152 24 128 bfloat16 0.49411200881004336 0.48844799399375916 0.4936000108718872 0.4944640100002289 5 83.38 FLASH torch-sdpa
Flash (PyTorch SDPA) torch_flash_ma flux_L256 1 1280 24 128 bfloat16 0.5234112024307251 0.5224320292472839 0.5235199928283691 0.5235840082168579 5 90.62 FLASH torch-sdpa
Flash (PyTorch SDPA) torch_flash_ma flux_L320 1 1344 24 128 bfloat16 0.6527232170104981 0.6503040194511414 0.6524800062179565 0.6545600295066833 5 95.06 FLASH torch-sdpa
Flash (PyTorch SDPA) torch_flash_ma flux_L384 1 1408 24 128 bfloat16 0.682803213596344 0.6805760264396667 0.6828799843788147 0.6832640171051025 5 99.88 FLASH torch-sdpa
Flash (PyTorch SDPA) torch_flash_ma flux_L448 1 1472 24 128 bfloat16 0.7075456142425537 0.7057600021362305 0.7063360214233398 0.7070720195770264 5 103.81 FLASH torch-sdpa
Flash (PyTorch SDPA) torch_flash_ma flux_L512 1 1536 24 128 bfloat16 0.7379711985588073 0.7368639707565308 0.7372480034828186 0.7391039729118347 5 109.12 FLASH torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L128 1 1152 24 128 bfloat16 0.5874239921569824 0.5861759781837463 0.5873280167579651 0.5877439975738525 5 83.38 EFFICIENT torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L256 1 1280 24 128 bfloat16 0.6502719998359681 0.6490240097045898 0.649183988571167 0.6517760157585144 5 90.62 EFFICIENT torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L320 1 1344 24 128 bfloat16 0.7812095880508423 0.7761600017547607 0.7803199887275696 0.7852799892425537 5 95.94 EFFICIENT torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L384 1 1408 24 128 bfloat16 0.7948480010032654 0.7911999821662903 0.7935360074043274 0.7948480248451233 5 100.0 EFFICIENT torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L448 1 1472 24 128 bfloat16 0.8463295936584473 0.8449919819831848 0.8459839820861816 0.8461120128631592 5 103.81 EFFICIENT torch-sdpa
MemEff (PyTorch SDPA) torch_mem_eff flux_L512 1 1536 24 128 bfloat16 0.9538687944412232 0.9492800235748291 0.9518399834632874 0.9581760168075562 5 109.12 EFFICIENT torch-sdpa
xFormers xformers_meff flux_L128 1 1152 24 128 bfloat16 0.4515071928501129 0.44364801049232483 0.4524799883365631 0.4557119905948639 5 83.38 memory_efficient xformers
xFormers xformers_meff flux_L256 1 1280 24 128 bfloat16 0.46787199974060056 0.46489599347114563 0.4684160053730011 0.46908798813819885 5 90.62 memory_efficient xformers
xFormers xformers_meff flux_L320 1 1344 24 128 bfloat16 0.6001471996307373 0.596992015838623 0.5984640121459961 0.6016640067100525 5 95.06 memory_efficient xformers
xFormers xformers_meff flux_L384 1 1408 24 128 bfloat16 0.6023231983184815 0.5997440218925476 0.6031039953231812 0.6032639741897583 5 99.88 memory_efficient xformers
xFormers xformers_meff flux_L448 1 1472 24 128 bfloat16 0.6411136031150818 0.6381760239601135 0.6414719820022583 0.6421440243721008 5 103.81 memory_efficient xformers
xFormers xformers_meff flux_L512 1 1536 24 128 bfloat16 0.6594688057899475 0.6441280245780945 0.6496639847755432 0.6527680158615112 5 109.12 memory_efficient xformers
Compiled (default) torch_flash_compiled_default flux_L128 1 1152 24 128 bfloat16 0.5181439876556396 0.5141760110855103 0.5175679922103882 0.5197759866714478 5 83.38 FLASH torch-sdpa
Compiled (default) torch_flash_compiled_default flux_L256 1 1280 24 128 bfloat16 0.5579584002494812 0.5549119710922241 0.5582720041275024 0.5598080158233643 5 90.62 FLASH torch-sdpa
Compiled (default) torch_flash_compiled_default flux_L320 1 1344 24 128 bfloat16 0.6872959971427918 0.6853119730949402 0.687391996383667 0.6883519887924194 5 95.25 FLASH torch-sdpa
Compiled (default) torch_flash_compiled_default flux_L384 1 1408 24 128 bfloat16 0.716153597831726 0.7128639817237854 0.7160959839820862 0.7167680263519287 5 99.88 FLASH torch-sdpa
Compiled (default) torch_flash_compiled_default flux_L448 1 1472 24 128 bfloat16 0.7418303966522217 0.7386879920959473 0.7400959730148315 0.7415040135383606 5 103.81 FLASH torch-sdpa
Compiled (default) torch_flash_compiled_default flux_L512 1 1536 24 128 bfloat16 0.7745471954345703 0.7708160281181335 0.7740799784660339 0.7753919959068298 5 109.12 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L128 1 1152 24 128 bfloat16 0.6468096017837525 0.6144000291824341 0.6245759725570679 0.6483200192451477 5 67.5 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L256 1 1280 24 128 bfloat16 0.7060160160064697 0.6689280271530151 0.6851199865341187 0.7184960246086121 5 75.0 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L320 1 1344 24 128 bfloat16 0.8332608103752136 0.7953600287437439 0.8155840039253235 0.8403519988059998 5 80.38 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L384 1 1408 24 128 bfloat16 0.8719295978546142 0.8470720052719116 0.849727988243103 0.8745279908180237 5 82.5 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L448 1 1472 24 128 bfloat16 0.9034304022789001 0.8677120208740234 0.8835520148277283 0.9034240245819092 5 86.25 FLASH torch-sdpa
Compiled (max-autotune) torch_flash_compiled_max_autotune flux_L512 1 1536 24 128 bfloat16 0.9387519836425782 0.9154239892959595 0.9213759899139404 0.9359679818153381 5 90.0 FLASH torch-sdpa
HF Kernels Flash Attn hf_kernels_flash_attn flux_L128 1 1152 24 128 bfloat16 0.3455295979976654 0.34355199337005615 0.34563198685646057 0.34643200039863586 5 83.38 flash-attn hf-kernels
HF Kernels Flash Attn hf_kernels_flash_attn flux_L256 1 1280 24 128 bfloat16 0.3756160080432892 0.37411201000213623 0.3752000033855438 0.3770880103111267 5 90.62 flash-attn hf-kernels
HF Kernels Flash Attn hf_kernels_flash_attn flux_L320 1 1344 24 128 bfloat16 0.4953216016292572 0.49324798583984375 0.49433600902557373 0.49663999676704407 5 95.06 flash-attn hf-kernels
HF Kernels Flash Attn hf_kernels_flash_attn flux_L384 1 1408 24 128 bfloat16 0.5157055854797363 0.5142719745635986 0.516319990158081 0.516543984413147 5 99.88 flash-attn hf-kernels
HF Kernels Flash Attn hf_kernels_flash_attn flux_L448 1 1472 24 128 bfloat16 0.5356672048568726 0.5346879959106445 0.5358080267906189 0.5361599922180176 5 103.81 flash-attn hf-kernels
HF Kernels Flash Attn hf_kernels_flash_attn flux_L512 1 1536 24 128 bfloat16 0.5587136030197144 0.5557760000228882 0.5574079751968384 0.5581120252609253 5 109.12 flash-attn hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L128 1 1152 24 128 bfloat16 0.3619711995124817 0.3603839874267578 0.361952006816864 0.3624640107154846 5 83.38 flash-attn3 hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L256 1 1280 24 128 bfloat16 0.3912447988986969 0.3892799913883209 0.3909760117530823 0.3922559916973114 5 90.62 flash-attn3 hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L320 1 1344 24 128 bfloat16 0.5258048176765442 0.5240640044212341 0.5248960256576538 0.5248960256576538 5 95.06 flash-attn3 hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L384 1 1408 24 128 bfloat16 0.5276032090187073 0.5265600085258484 0.5277760028839111 0.5282559990882874 5 99.88 flash-attn3 hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L448 1 1472 24 128 bfloat16 0.5656383991241455 0.5639039874076843 0.5657920241355896 0.5668479800224304 5 103.81 flash-attn3 hf-kernels
HF Kernels Flash Attn3 hf_kernels_flash_attn3 flux_L512 1 1536 24 128 bfloat16 0.5789952039718628 0.5689600110054016 0.5698239803314209 0.5713919997215271 5 109.12 flash-attn3 hf-kernels
▶ code ▶ output ▶ uv-logs | Cell: combine | 36.89s | Raw