Executive Finding
Q6_0_ROCMFPX_STRIX_QUALITY recipe moves bulk tensors back to Q6 ROCmFPX and protects attention/output plus selected FFN down/gate tensors with Q8 ROCmFPX. That raised the model to 7.37 BPW and recovered Q6-class behavior on HermesAgent-20 while also improving served MTP decode speed against Q6.
Recipe Size and Tensor Routing
Tensor Type Mix
What Changed In The Code
The implementation added a new public quantization ftype, exposed it through llama-quantize and the wrapper script, taught the loader its display name, and added a new quant-selection path in src/llama-quant.cpp.
- Old Strix Speed / Lean: default tensor type was
Q4_0_ROCMFP4_FAST, with selected tensors promoted to Q6. This produced strong size/speed pressure but failed the Q6 quality floor. - New Strix Quality: default tensor type is
Q6_0_ROCMFPX; token/output, attention Q/K/V/O or fused QKV, and selected first/last/high-sensitivity FFN down/gate tensors are promoted toQ8_0_ROCMFPX. - Resulting model:
24018.32 MiB,7.37 BPW, SHA256144062b43fade17c....
Quality Benchmarks
Perplexity Check
HermesAgent-20 Category Profile
HermesAgent-20 Scenario Grid
Select a scenario to compare exactly how New FP6, Unsloth Q6, and Old FP6 passed, partially passed, or failed.
HermesAgent-20 Interpretation
HermesAgent-20 is the benchmark that exposed the original problem: the old FP6 speed recipe scored 0.60 while the Unsloth Q6 baseline scored 0.76. The new quality recipe scored 0.78, so the recipe change closed the gap and slightly exceeded Q6 on the overall score.
- Where the new recipe improved over Q6: memory/recall rose from 80 to 84, scheduling/delivery rose from 55 to 75, and HA-02 plus HA-15 were major scenario wins.
- Where Q6 still has an edge: skills/procedural memory was 100 on Q6 and 88 on the new FP6. HA-04 and HA-11 were regressions for the new recipe.
- Why old FP6 failed: the old speed recipe was not merely a smaller Q6; it routed 388 tensors through FP4-fast bulk. It passed some scenarios, but collapsed on several procedural/tool-memory cases.
- Runtime caveat: the new HA-20 run took 1541.5 s versus Q6 at 1037.5 s. That wall time reflects benchmark-specific long responses, especially HA-02, and should be read separately from the controlled served speed matrix.
Served MTP Decode: FP6 vs Q6 on ROCm
End-to-End Request Time
ROCm vs Vulkan Decode
ROCm vs Vulkan Prompt Processing
Speed Takeaways
- The new FP6 recipe is faster than Q6 end-to-end on every controlled ROCm served row, with request totals 2.3% to 21.5% lower depending on prompt length.
- Decode ranges from
29.52-30.73 tok/son the 512, 4k, and 16k prompt rows, and drops to15.72 tok/sat 64k. The 64k row is a long-context stress point, not the headline speed for normal short and mid-context serving. - The speed win is mainly decode and long-context behavior. Short-prompt prefill at 512 and 2k is still slower than Q6, but generation throughput is higher at every prompt length.
- ROCm is the correct backend for this recipe. Vulkan matched short-prompt prefill but lost decode on every row and was slower end-to-end at every prompt length.
- All controlled speed rows reported
cache_n=0, so the comparison was not a prompt-cache artifact.
Detailed Tables
Dry-Run Recipe Audit
Quality Results
Served Speed Rows
Release Links
- Hugging Face model page: jcbtc/Chadrockv2-Qwen3.6-27B-ROCmFP6-STRIX-QUALITY
- ROCmFPX recipe branch: github.com/ciru-ai/ROCmFPX/tree/rocmfp6-strix-quality
- Upstream compare link: charlie12345/ROCmFPX compare view
