ROCmFP6 Strix Quality Research Report

HermesAgent-20

0.78

New FP6 vs Q6 0.76 and old FP6 0.60

HumanEval+

155/164

New FP6 vs Q6 153/164

Served Decode Range

15.72-30.73 tok/s

New FP6 ROCm across 512 to 64k prompt rows

Chadrock v2 Qwen3.6 27B ROCmFP6 Strix Quality AMD-tuned GGUF release card

Executive Finding

The old Strix Speed recipe was too small for the quality target: it used FP4-fast bulk routing and landed at 4.82 BPW. The new Q6_0_ROCMFPX_STRIX_QUALITY recipe moves bulk tensors back to Q6 ROCmFPX and protects attention/output plus selected FFN down/gate tensors with Q8 ROCmFPX. That raised the model to 7.37 BPW and recovered Q6-class behavior on HermesAgent-20 while also improving served MTP decode speed against Q6.

Recipe Size and Tensor Routing

Tensor Type Mix

What Changed In The Code

The implementation added a new public quantization ftype, exposed it through llama-quantize and the wrapper script, taught the loader its display name, and added a new quant-selection path in src/llama-quant.cpp.

Old Strix Speed / Lean: default tensor type was Q4_0_ROCMFP4_FAST, with selected tensors promoted to Q6. This produced strong size/speed pressure but failed the Q6 quality floor.
New Strix Quality: default tensor type is Q6_0_ROCMFPX; token/output, attention Q/K/V/O or fused QKV, and selected first/last/high-sensitivity FFN down/gate tensors are promoted to Q8_0_ROCMFPX.
Resulting model: 24018.32 MiB, 7.37 BPW, SHA256 144062b43fade17c....

Quality Benchmarks

Perplexity Check

HermesAgent-20 Category Profile

HermesAgent-20 Scenario Grid

Select a scenario to compare exactly how New FP6, Unsloth Q6, and Old FP6 passed, partially passed, or failed.

HermesAgent-20 Interpretation

HermesAgent-20 is the benchmark that exposed the original problem: the old FP6 speed recipe scored 0.60 while the Unsloth Q6 baseline scored 0.76. The new quality recipe scored 0.78, so the recipe change closed the gap and slightly exceeded Q6 on the overall score.

Where the new recipe improved over Q6: memory/recall rose from 80 to 84, scheduling/delivery rose from 55 to 75, and HA-02 plus HA-15 were major scenario wins.
Where Q6 still has an edge: skills/procedural memory was 100 on Q6 and 88 on the new FP6. HA-04 and HA-11 were regressions for the new recipe.
Why old FP6 failed: the old speed recipe was not merely a smaller Q6; it routed 388 tensors through FP4-fast bulk. It passed some scenarios, but collapsed on several procedural/tool-memory cases.
Runtime caveat: the new HA-20 run took 1541.5 s versus Q6 at 1037.5 s. That wall time reflects benchmark-specific long responses, especially HA-02, and should be read separately from the controlled served speed matrix.

Served MTP Decode: FP6 vs Q6 on ROCm

End-to-End Request Time

ROCm vs Vulkan Decode

ROCm vs Vulkan Prompt Processing

Speed Takeaways

The new FP6 recipe is faster than Q6 end-to-end on every controlled ROCm served row, with request totals 2.3% to 21.5% lower depending on prompt length.
Decode ranges from 29.52-30.73 tok/s on the 512, 4k, and 16k prompt rows, and drops to 15.72 tok/s at 64k. The 64k row is a long-context stress point, not the headline speed for normal short and mid-context serving.
The speed win is mainly decode and long-context behavior. Short-prompt prefill at 512 and 2k is still slower than Q6, but generation throughput is higher at every prompt length.
ROCm is the correct backend for this recipe. Vulkan matched short-prompt prefill but lost decode on every row and was slower end-to-end at every prompt length.
All controlled speed rows reported cache_n=0, so the comparison was not a prompt-cache artifact.

ROCmFP6 Strix Quality: Q6-Level Quality With Served-Speed Gains

Executive Finding

Recipe Size and Tensor Routing

Tensor Type Mix

What Changed In The Code

Quality Benchmarks

Perplexity Check

HermesAgent-20 Category Profile

HermesAgent-20 Scenario Grid

HermesAgent-20 Interpretation

Served MTP Decode: FP6 vs Q6 on ROCm

End-to-End Request Time

ROCm vs Vulkan Decode

ROCm vs Vulkan Prompt Processing

Speed Takeaways

Detailed Tables

Dry-Run Recipe Audit

Quality Results

Served Speed Rows

Release Links