Qwen3-8B, N=4096 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 2.324625                                        1.730227                                         1.798765                                            2.089546
1          4.0                                 8.030077                                        6.895498                                         7.175147                                            8.375149
2          8.0                                15.589988                                       13.867586                                        14.273973                                           16.865368
3         16.0                                30.218626                                       27.782878                                        28.170183                                           33.656660
4         32.0                                59.561135                                       55.449069                                        54.657812                                           67.525509
5         64.0                                87.719220                                      110.547301                                       109.337135                                          136.180223
6        128.0                               170.279967                                      217.425190                                       217.710078                                          266.649425
7        256.0                               161.822375                                      416.498609                                       418.298910                                          536.144515
8        512.0                               179.147654                                      499.588484                                       488.854895                                          602.733117
9       1024.0                               179.959586                                      882.886404                                       870.881035                                         1115.360499
10      2048.0                               181.249230                                      934.331332                                       921.877779                                         1133.241380
11      4096.0                               182.348358                                      957.115584                                       947.749584                                         1142.627374
12      8192.0                               206.923472                                      995.696476                                       962.085285                                         1221.595995
Qwen3-8B, N=4096 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 2.305664                                        1.726579                                         1.794089                                            2.093761
1          4.0                                 8.087505                                        6.871028                                         7.148120                                            8.347734
2          8.0                                15.668138                                       13.815297                                        14.215787                                           16.691926
3         16.0                                29.763776                                       27.671342                                        28.053908                                           33.687176
4         32.0                                59.377819                                       55.248513                                        54.454372                                           67.166973
5         64.0                                87.148653                                      110.138741                                       108.914794                                          134.592000
6        128.0                               168.899236                                      217.359231                                       216.820179                                          268.572721
7        256.0                               161.750399                                      416.469636                                       418.224550                                          536.088921
8        512.0                               179.326320                                      497.433612                                       487.340018                                          599.647260
9       1024.0                               178.251464                                      886.663150                                       867.288187                                         1109.070999
10      2048.0                               182.912843                                      933.553015                                       917.881958                                         1125.129862
11      4096.0                               182.192185                                      957.974207                                       941.356034                                         1146.365805
12      8192.0                               207.065759                                      989.076384                                       955.587139                                         1220.552697
Qwen3-8B, N=24576 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.681650                                        2.813583                                         2.843756                                            3.148304
1          4.0                                 6.656080                                       11.236252                                        11.369306                                           11.771388
2          8.0                                13.281620                                       23.381004                                        23.551528                                           24.197322
3         16.0                                26.369821                                       48.086547                                        48.285698                                           50.124557
4         32.0                                31.977394                                      101.887583                                       101.530258                                          107.513267
5         64.0                                76.494202                                      240.563881                                       239.595401                                          255.120444
6        128.0                                89.972687                                      669.234898                                       665.608008                                          745.426609
7        256.0                               135.030536                                      956.902657                                       957.019916                                         1017.845426
8        512.0                               180.764393                                     1052.447066                                      1042.017961                                         1124.448531
9       1024.0                               214.667612                                     1111.815447                                      1108.952279                                         1172.207362
10      2048.0                               216.716290                                     1132.907496                                      1130.637164                                         1164.890893
11      4096.0                               226.678535                                     1166.471065                                      1156.916294                                         1208.688514
12      8192.0                               231.326786                                     1186.440820                                      1180.051067                                         1236.869563
Qwen3-8B, N=4096 K=12288, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.827324                                        2.211460                                         2.226344                                            2.392789
1          4.0                                 6.270835                                        8.881795                                         8.948081                                            9.614883
2          8.0                                12.189128                                       17.790583                                        17.922165                                           19.254882
3         16.0                                23.741338                                       35.804636                                        35.675982                                           38.553915
4         32.0                                70.921626                                       70.797263                                        70.465845                                           77.143581
5         64.0                                72.849721                                      139.558963                                       140.582673                                          152.903971
6        128.0                               170.169116                                      270.807394                                       276.093428                                          305.940184
7        256.0                               190.710788                                      520.721935                                       517.422923                                          606.769202
8        512.0                               208.032905                                      580.593817                                       578.596629                                          657.293206
9       1024.0                               181.100187                                      999.733355                                       977.355937                                         1200.362717
10      2048.0                               183.269416                                      966.570583                                       939.413148                                         1209.105589
11      4096.0                               184.468257                                      967.556584                                       923.055175                                         1206.591591
12      8192.0                               209.069212                                     1012.834168                                       975.005415                                         1245.066660
Qwen3-14B, N=5120 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 3.082365                                        2.279921                                         2.353641                                            2.670685
1          4.0                                 9.769869                                        9.302223                                         9.433663                                           10.725696
2          8.0                                18.516772                                       18.319217                                        18.775297                                           21.647518
3         16.0                                31.731458                                       36.708455                                        37.153319                                           43.194659
4         32.0                                59.212286                                       73.328476                                        72.548670                                           86.794413
5         64.0                                83.330225                                      146.539994                                       145.653382                                          174.564760
6        128.0                               187.443575                                      288.713267                                       286.945324                                          348.033701
7        256.0                               213.659868                                      549.966448                                       549.021527                                          682.289190
8        512.0                               219.140022                                      646.336512                                       636.451169                                          764.312356
9       1024.0                               218.305768                                     1064.138365                                      1042.880822                                         1280.179777
10      2048.0                               226.024298                                     1122.695501                                      1109.545487                                         1299.617245
11      4096.0                               224.639991                                     1059.300904                                      1035.392625                                         1263.916890
12      8192.0                               222.818501                                     1066.753977                                      1037.026110                                         1235.099617
Qwen3-14B, N=5120 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 3.086230                                        2.282214                                         2.356742                                            2.669181
1          4.0                                 9.855876                                        9.166545                                         9.438511                                           10.691307
2          8.0                                18.917385                                       18.320448                                        18.779652                                           21.494623
3         16.0                                32.032264                                       36.677808                                        37.110022                                           43.161456
4         32.0                                59.757453                                       73.302817                                        72.525295                                           86.848110
5         64.0                                83.603996                                      146.482262                                       145.589446                                          174.609933
6        128.0                               189.120309                                      289.312951                                       287.765135                                          348.569547
7        256.0                               212.402294                                      550.307646                                       550.018019                                          689.643538
8        512.0                               219.634120                                      645.278342                                       635.572630                                          766.210722
9       1024.0                               218.638870                                     1056.930316                                      1041.911171                                         1283.073761
10      2048.0                               226.130443                                     1123.278604                                      1110.555590                                         1298.541780
11      4096.0                               224.423111                                     1060.252436                                      1036.582555                                         1268.362368
12      8192.0                               222.690081                                     1067.311387                                      1037.898190                                         1235.338826
Qwen3-14B, N=34816 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.681770                                        3.246184                                         3.263700                                            3.377475
1          4.0                                 6.565408                                       14.116375                                        14.326203                                           15.462654
2          8.0                                13.111643                                       26.933887                                        27.034329                                           28.176132
3         16.0                                26.108874                                       56.837179                                        56.445237                                           58.955298
4         32.0                                43.374828                                      124.499566                                       122.670549                                          128.087693
5         64.0                                48.276973                                      302.831657                                       302.053037                                          323.638243
6        128.0                                95.696093                                      828.515837                                       825.485382                                          870.059941
7        256.0                               192.079550                                      863.884847                                       862.219660                                          894.836819
8        512.0                               192.695023                                     1037.803325                                      1032.605599                                         1070.027846
9       1024.0                               216.884377                                     1113.555379                                      1111.804407                                         1143.936367
10      2048.0                               231.914578                                     1177.797546                                      1175.535195                                         1214.256392
11      4096.0                               232.241921                                     1201.911404                                      1198.941905                                         1240.068142
12      8192.0                               231.105443                                     1215.626307                                      1210.199570                                         1258.024344
Qwen3-14B, N=5120 K=17408, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.714712                                        2.887900                                         2.850709                                            3.048977
1          4.0                                 6.350039                                       11.507300                                        11.406290                                           12.213049
2          8.0                                11.585574                                       22.998834                                        22.811093                                           24.426116
3         16.0                                22.930437                                       45.935283                                        45.233184                                           49.138344
4         32.0                                49.613177                                       91.440590                                        89.116060                                           98.461384
5         64.0                                79.030336                                      181.045484                                       176.844944                                          195.112771
6        128.0                               175.021173                                      351.561573                                       343.924429                                          388.716915
7        256.0                               208.528124                                      670.005078                                       652.750371                                          774.111736
8        512.0                               215.798308                                      742.787261                                       724.588012                                          833.618106
9       1024.0                               221.585420                                     1086.935883                                      1053.959309                                         1336.501744
10      2048.0                               225.581883                                     1077.472883                                      1056.925512                                         1335.377433
11      4096.0                               224.562121                                     1092.402038                                      1057.857327                                         1290.917102
12      8192.0                               224.698633                                     1059.824378                                      1036.392680                                         1271.737031
Llama-3.1-70B, N=8192 K=8192, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.698562                                        3.031826                                         3.047319                                            3.314132
1          4.0                                 6.366885                                       12.040107                                        12.145956                                           13.076667
2          8.0                                12.553697                                       24.849409                                        25.374995                                           26.903765
3         16.0                                24.715289                                       51.812482                                        52.185418                                           58.090831
4         32.0                                48.028603                                      112.720793                                       112.340099                                          124.548956
5         64.0                                82.307603                                      254.879254                                       255.288303                                          286.840157
6        128.0                               102.186604                                      508.802833                                       516.822541                                          589.504305
7        256.0                               174.132917                                      911.293421                                       894.546024                                         1051.602984
8        512.0                               178.920750                                     1035.064555                                      1035.623753                                         1173.549701
9       1024.0                               183.584385                                     1063.630655                                      1056.408253                                         1182.738811
10      2048.0                               183.915529                                     1020.651440                                      1007.379060                                         1176.767236
11      4096.0                               208.025854                                     1109.603691                                      1091.484727                                         1219.344681
12      8192.0                               221.946270                                     1133.675025                                      1117.710095                                         1270.162932
Llama-3.1-70B, N=57344 K=8192, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.703765                                        3.456568                                         3.470778                                            4.102649
1          4.0                                 6.634405                                       13.827267                                        13.905614                                           13.944784
2          8.0                                13.203205                                       28.672360                                        28.764811                                           29.439854
3         16.0                                26.238431                                       60.490299                                        60.450431                                           70.527219
4         32.0                                26.690673                                      133.776626                                       133.485657                                          136.353066
5         64.0                                53.409705                                      334.333654                                       332.014956                                          353.791622
6        128.0                               149.890228                                      696.458180                                       695.409855                                          707.025765
7        256.0                               205.368776                                      916.751009                                       917.237351                                          937.197547
8        512.0                               208.314315                                     1092.457330                                      1087.729523                                         1108.739081
9       1024.0                               224.888626                                     1156.370655                                      1156.874644                                         1179.480958
10      2048.0                               226.349477                                     1213.557207                                      1202.935014                                         1227.210988
11      4096.0                               232.352383                                     1249.683069                                      1249.161550                                         1275.909357
12      8192.0                               232.797026                                     1271.009683                                      1270.315837                                         1296.834016
Llama-3.1-70B, N=8192 K=28672, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (TFLOP/s (larger is better))  mxfp4-cutlass-had (TFLOP/s (larger is better))  mxfp4-cutlass-wush (TFLOP/s (larger is better))  mxfp4-cutlass-noquant (TFLOP/s (larger is better))
0          1.0                                 1.622564                                        3.529729                                         3.490130                                            3.662038
1          4.0                                 6.487812                                       13.815220                                        13.679456                                           14.413114
2          8.0                                12.351109                                       29.174722                                        28.900015                                           29.685529
3         16.0                                24.737062                                       60.222056                                        59.377586                                           62.399584
4         32.0                                49.094774                                      129.044987                                       127.340913                                          137.206190
5         64.0                                73.086390                                      285.203335                                       278.777389                                          302.050007
6        128.0                               107.323839                                      557.991373                                       546.327515                                          599.622214
7        256.0                               207.104709                                      835.906071                                       823.163487                                          907.613152
8        512.0                               217.530097                                     1015.176202                                       996.922151                                         1116.007649
9       1024.0                               185.479004                                     1035.237720                                      1018.911943                                         1153.598714
10      2048.0                               184.563154                                     1062.830523                                      1043.666518                                         1179.255611
11      4096.0                               208.896145                                     1115.274552                                      1096.501604                                         1240.059399
12      8192.0                               221.257721                                     1064.758572                                      1045.742965                                         1192.034715

######################################################################################################################################################

Qwen3-8B, N=4096 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.422007                                2.354622                                 1.628021
1          4.0                                 1.578632                                2.386453                                 1.628790
2          8.0                                 1.616135                                2.402952                                 1.867750
3         16.0                                 1.510022                                2.301684                                 2.154494
4         32.0                                 1.737455                                2.464400                                 2.735397
5         64.0                                 2.032870                                2.586974                                 2.789151
6        128.0                                 2.297851                                2.875415                                 2.828635
7        256.0                                 2.879941                                3.621308                                 3.541432
8        512.0                                 3.419795                                4.821603                                 5.593421
9       1024.0                                 5.256875                                7.352617                                 7.757982
10      2048.0                                10.973598                               12.138616                                12.402096
11      4096.0                                21.441794                               21.571841                                22.874825
12      8192.0                                69.257462                               40.194965                                44.426892
Qwen3-8B, N=4096 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.420335                                2.371109                                 1.630822
1          4.0                                 1.581332                                2.393470                                 1.631086
2          8.0                                 1.620056                                2.409941                                 1.870324
3         16.0                                 1.512252                                2.384396                                 2.047386
4         32.0                                 1.741421                                2.472238                                 2.747269
5         64.0                                 2.038507                                2.597810                                 2.797011
6        128.0                                 2.300280                                2.881333                                 2.834259
7        256.0                                 2.774019                                3.631087                                 3.544683
8        512.0                                 3.437306                                4.821873                                 5.609316
9       1024.0                                 5.286751                                7.270321                                 7.772890
10      2048.0                                11.084812                               12.152502                                12.462680
11      4096.0                                21.763739                               21.568369                                22.972372
12      8192.0                                69.266047                               40.383647                                44.666059
Qwen3-8B, N=24576 K=4096, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.417846                                2.365053                                 1.631377
1          4.0                                 1.580891                                2.393077                                 1.632123
2          8.0                                 1.618361                                2.412239                                 1.866237
3         16.0                                 1.513844                                2.425544                                 2.046296
4         32.0                                 1.743500                                2.473079                                 2.746865
5         64.0                                 2.038301                                2.593684                                 2.797236
6        128.0                                 2.302851                                2.882427                                 2.860666
7        256.0                                 2.776146                                3.632908                                 3.543251
8        512.0                                 3.444967                                4.834728                                 5.613317
9       1024.0                                 5.296475                                7.269300                                 7.752942
10      2048.0                                11.121200                               12.145305                                12.503139
11      4096.0                                21.823431                               21.661422                                23.027289
12      8192.0                                69.538352                               40.482926                                44.737621
Qwen3-8B, N=4096 K=12288, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.459411                                2.381860                                 2.076258
1          4.0                                 1.678447                                2.417383                                 2.077967
2          8.0                                 1.618461                                2.428109                                 2.081319
3         16.0                                 2.025721                                2.422212                                 2.238824
4         32.0                                 2.223509                                2.729533                                 2.931644
5         64.0                                 2.661339                                3.410290                                 3.424150
6        128.0                                 2.869485                                4.297555                                 3.856589
7        256.0                                 4.356715                                6.171122                                 6.815776
8        512.0                                 7.551473                                9.598120                                10.274083
9       1024.0                                14.699552                               16.941548                                18.033284
10      2048.0                                50.941548                               30.961281                                34.586868
11      4096.0                               130.874005                               76.134562                                90.763910
12      8192.0                               262.151814                              149.739229                               181.046646
Qwen3-14B, N=5120 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.545723                                2.253903                                 1.523314
1          4.0                                 1.595398                                2.396813                                 1.745431
2          8.0                                 1.658033                                2.416916                                 1.855546
3         16.0                                 1.557334                                2.421108                                 2.154051
4         32.0                                 1.744675                                2.381165                                 2.747375
5         64.0                                 2.073022                                2.658274                                 2.821803
6        128.0                                 2.446342                                3.034702                                 3.169057
7        256.0                                 3.181400                                3.784820                                 3.893793
8        512.0                                 3.865860                                5.415031                                 6.106715
9       1024.0                                 6.496753                                8.451290                                 9.161178
10      2048.0                                13.859297                               14.589829                                15.033817
11      4096.0                                26.782783                               26.345028                                28.495771
12      8192.0                               109.067984                               58.067288                                65.030567
Qwen3-14B, N=5120 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.430422                                2.368941                                 1.639257
1          4.0                                 1.591719                                2.400372                                 1.745308
2          8.0                                 1.654503                                2.420732                                 1.850487
3         16.0                                 1.559176                                2.424143                                 2.040047
4         32.0                                 1.857787                                2.494766                                 2.746926
5         64.0                                 2.073123                                2.657492                                 2.820789
6        128.0                                 2.445120                                3.037292                                 3.065827
7        256.0                                 3.121809                                3.901689                                 3.894817
8        512.0                                 3.872906                                5.414945                                 6.109537
9       1024.0                                 6.427995                                8.508511                                 9.278142
10      2048.0                                13.854922                               14.547731                                15.025911
11      4096.0                                26.939033                               26.348437                                28.564995
12      8192.0                               108.903489                               58.330461                                65.010473
Qwen3-14B, N=34816 K=5120, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.428298                                2.367708                                 1.643347
1          4.0                                 1.592616                                2.399417                                 1.741838
2          8.0                                 1.655104                                2.423950                                 1.851354
3         16.0                                 1.444531                                2.420918                                 2.154979
4         32.0                                 1.858303                                2.495126                                 2.747058
5         64.0                                 2.073197                                2.657128                                 2.820156
6        128.0                                 2.444265                                2.916684                                 3.061068
7        256.0                                 3.228319                                3.901576                                 3.894663
8        512.0                                 3.866363                                5.415511                                 5.991999
9       1024.0                                 6.493126                                8.619847                                 9.274606
10      2048.0                                13.831544                               14.489163                                15.125574
11      4096.0                                26.925590                               26.348603                                28.386780
12      8192.0                               109.139186                               58.318636                                64.802511
Qwen3-14B, N=5120 K=17408, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.592621                                2.396876                                 2.808610
1          4.0                                 1.520897                                2.424286                                 2.845764
2          8.0                                 1.652493                                2.470835                                 2.986666
3         16.0                                 2.056514                                2.609414                                 3.606658
4         32.0                                 2.443496                                2.913269                                 4.744671
5         64.0                                 2.939176                                3.704429                                 5.204518
6        128.0                                 3.545732                                5.009682                                 6.625288
7        256.0                                 5.589496                                7.672363                                 9.517546
8        512.0                                11.882211                               12.729873                                15.494283
9       1024.0                                23.431063                               22.826649                                27.642309
10      2048.0                                79.843844                               43.245430                                51.003725
11      4096.0                               185.543177                              106.558860                               129.074882
12      8192.0                               371.669148                              222.581605                               258.925992
Llama-3.1-70B, N=8192 K=8192, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.563621                                2.275481                                 1.698368
1          4.0                                 1.617506                                2.409927                                 1.877581
2          8.0                                 1.517010                                2.423872                                 1.893373
3         16.0                                 1.743410                                2.463727                                 2.195974
4         32.0                                 1.938568                                2.488738                                 2.766293
5         64.0                                 2.301333                                2.881694                                 2.857417
6        128.0                                 2.918030                                3.648872                                 3.080530
7        256.0                                 3.373083                                4.724595                                 5.600542
8        512.0                                 5.423138                                7.390730                                 7.536857
9       1024.0                                11.164986                               12.186777                                12.574421
10      2048.0                                22.050236                               21.532840                                22.841756
11      4096.0                                69.588275                               40.796096                                43.899534
12      8192.0                               174.816159                              100.824348                               111.773497
Llama-3.1-70B, N=57344 K=8192, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.564379                                2.387103                                 1.819581
1          4.0                                 1.619089                                2.412328                                 1.874156
2          8.0                                 1.480557                                2.306541                                 1.892881
3         16.0                                 1.744015                                2.461782                                 2.196791
4         32.0                                 2.049276                                2.602782                                 2.773648
5         64.0                                 2.300586                                2.878107                                 2.745355
6        128.0                                 2.915565                                3.650406                                 3.080675
7        256.0                                 3.471783                                4.841010                                 5.619205
8        512.0                                 5.325181                                7.390643                                 7.563679
9       1024.0                                11.204928                               12.184443                                12.470604
10      2048.0                                22.080400                               21.652333                                22.939675
11      4096.0                                69.420484                               40.642301                                43.979604
12      8192.0                               174.626302                              100.729494                               111.835336
Llama-3.1-70B, N=8192 K=28672, HAD=32, BF16 vs MXFP4 GEMMs TFLOP/s:
BF16 vs MXFP4 GEMMs:
    batch_size  torch-bf16 (usec (larger is worse))  mxfp4-had (usec (larger is worse))  mxfp4-wush (usec (larger is worse))
0          1.0                                 1.493283                                2.427836                                 3.405943
1          4.0                                 1.675769                                2.452935                                 3.390396
2          8.0                                 2.033186                                2.569877                                 3.428617
3         16.0                                 2.198451                                2.692320                                 3.963310
4         32.0                                 2.847471                                3.471167                                 4.970208
5         64.0                                 3.119583                                4.597460                                 5.566523
6        128.0                                 4.929105                                6.792581                                 6.817099
7        256.0                                 9.921723                               10.880892                                12.463545
8        512.0                                19.389554                               19.270768                                21.312904
9       1024.0                                54.719840                               35.901764                                40.389480
10      2048.0                               153.194796                               88.680647                               101.340674
11      4096.0                               305.307985                              177.801683                               212.575468
12      8192.0                               612.175991                              372.529718                               426.565817