Run Order
- Download one of the Chadrock GGUF models from jcbtc on Hugging Face.
- Build the runner from the pinned Chadrock ROCmFP4 commit.
- Start the built
llama-serverwith one of the direct commands below. - Send
/completionor/v1/chat/completionsrequests with the speculative fields enabled.
speculative.n_max,
but it cannot raise it above the server startup cap.
Build Commands
Copy this on a Strix Halo machine. The build produces the
build-strix-rocmfp4/bin/llama-server runner used by the
reproduction configs.
git clone https://github.com/ciru-ai/ROCmFPX.git
cd ROCmFPX
git checkout deaa996dab90b3ca6dd3ae5d453bedfcd983012d
env JOBS=16 scripts/build-strix-rocmfp4-mtp.sh llama-server llama-bench
Llama Configs
Copy one launch block, replace the /path/to/... GGUF path,
then run it from the ROCmFPX checkout.
./build-strix-rocmfp4/bin/llama-server \
-m /path/to/Qwen3.6-35B-A3B-NSC-ACE-SABER-MTP-F16-to-ROCmFP4-STRIX_LEAN.gguf \
--alias chadrock-35b-ace-saber-rocmfp4-cap4 \
--host 127.0.0.1 \
--port 18180 \
--jinja \
-c 32768 \
--reasoning off \
--reasoning-format none \
--reasoning-budget -1 \
--no-context-shift \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 2048 \
-ub 512 \
-t 16 \
-tb 32 \
-ctk f16 \
-ctv f16 \
--temp 0 \
--top-p 0.95 \
--top-k 20 \
--seed 123 \
--parallel 1 \
--no-mmproj \
--metrics \
--no-webui \
--no-cache-prompt \
--cache-ram 0 \
--slot-prompt-similarity 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-threads 16 \
--spec-draft-threads-batch 32 \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 4 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.25 \
--spec-draft-p-split 0.10 \
--no-spec-draft-backend-sampling \
--spec-draft-poll 1 \
--spec-draft-poll-batch 1
./build-strix-rocmfp4/bin/llama-server \
-m /path/to/Qwable-5-27B-Chadrock-v2-ROCmFP4.gguf \
--alias qwable-5-27b-chadrock-v2-rocmfp4 \
--host 127.0.0.1 \
--port 18180 \
--jinja \
-c 131072 \
--reasoning off \
--reasoning-format none \
--reasoning-budget -1 \
--no-context-shift \
-dev Vulkan0 \
-ngl 999 \
-fa on \
-b 2048 \
-ub 512 \
-t 16 \
-tb 32 \
-ctk q8_0 \
-ctv q8_0 \
--temp 0 \
--top-p 0.95 \
--top-k 20 \
--seed 123 \
--parallel 1 \
--no-mmproj \
--metrics \
--no-webui \
--no-cache-prompt \
--cache-ram 0 \
--slot-prompt-similarity 0.0 \
--spec-type draft-mtp \
--spec-draft-device Vulkan0 \
--spec-draft-ngl all \
--spec-draft-threads 16 \
--spec-draft-threads-batch 32 \
--spec-draft-type-k f16 \
--spec-draft-type-v f16 \
--spec-draft-n-max 6 \
--spec-draft-n-min 0 \
--spec-draft-p-min 0.0 \
--spec-draft-p-split 0.20 \
--no-spec-draft-backend-sampling \
--spec-draft-poll 1 \
--spec-draft-poll-batch 1
| Profile | Startup cap | Request policy | Measured decode |
|---|---|---|---|
| 35B ACE/SABER ROCmFP4 | SPEC_DRAFT_N_MAX=4 |
n_max=4, n_min=0, p_min=0.25 |
143.08 tok/s |
| Qwable 5 27B Chadrock v2 ROCmFP4 | SPEC_DRAFT_N_MAX=6 |
n_max=6, n_min=0, p_min=0.0 |
53.25 tok/s |
Request Payload
The request fields are top-level fields on both /completion
and OpenAI-compatible chat completions. This example uses
/completion.
curl -sS http://127.0.0.1:18180/completion \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Write a concise technical note about ROCmFPX MTP serving.",
"n_predict": 512,
"temperature": 0,
"ignore_eos": true,
"speculative.n_max": 4,
"speculative.n_min": 0,
"speculative.p_min": 0.25
}'
curl -sS http://127.0.0.1:18180/completion \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Write a concise technical note about ROCmFPX MTP serving.",
"n_predict": 512,
"temperature": 0,
"ignore_eos": true,
"speculative.n_max": 6,
"speculative.n_min": 0,
"speculative.p_min": 0.0
}'
Chadrock Models
Use the filtered Hugging Face profile link for the current Chadrock list, or jump directly to one of the published model repos below. Each tile uses the model card image from its Hugging Face page.
Validation
Before sharing a speed row, record the runner commit, model path, backend device, context, KV cache types, batch and ubatch, prompt-cache setting, generated tokens, decode tok/s, TTFP, and draft accepted/generated counters.
curl -sS http://127.0.0.1:18180/health
curl -sS http://127.0.0.1:18180/props | jq '.default_generation_settings'
curl -sS http://127.0.0.1:18180/metrics | head
Use served API rows or a CLI guard with draft counters for headline MTP
speed. Do not use standalone llama-bench TG as the headline
for MTP serving.
