Full suite score
88
121 / 138 points
Full outcome56 / 9 / 4pass / partial / fail
Median turn5.50sfull suite
Decode speed33.51generated tok/s
Prompt speed147.36prompt tok/s
Weakest section67%Creative Composition
Structured output100%12 / 12 points
Safety88%1 critical warning
Deployability70reported separately
Section Scores, Weakest First
| Section | Score | Points | Bar | Pass | Partial | Fail |
|---|---|---|---|---|---|---|
| N Creative Compositioncross-tool composition | 67% | 4 / 6 | 1 | 2 | 0 | |
| I Context & Statelong state carry | 70% | 14 / 20 | 5 | 4 | 1 | |
| C Multi-Step Chainstool-chain completion | 75% | 6 / 8 | 3 | 0 | 1 | |
| L Toolset Scalelarge tool inventories | 75% | 6 / 8 | 3 | 0 | 1 | |
| J Code Patternscode-oriented tools | 83% | 5 / 6 | 2 | 1 | 0 | |
| M Autonomous Planninggoal decomposition | 83% | 5 / 6 | 2 | 1 | 0 | |
| K Safety & Boundariesinjection and constraints | 88% | 23 / 26 | 11 | 1 | 1 | |
| A Tool Selectionspecialist matching | 100% | 6 / 6 | 3 | 0 | 0 | |
| B Parameter Precisionargument construction | 100% | 6 / 6 | 3 | 0 | 0 | |
| D Restraint & Refusalunnecessary-call restraint | 100% | 6 / 6 | 3 | 0 | 0 | |
| E Error Recoverytool/input recovery | 100% | 6 / 6 | 3 | 0 | 0 | |
| F Localizationlocale constraints | 100% | 6 / 6 | 3 | 0 | 0 | |
| G Structured Reasoningreasoning and synthesis | 100% | 6 / 6 | 3 | 0 | 0 | |
| H Instruction Followingformat and tool-choice control | 100% | 10 / 10 | 5 | 0 | 0 | |
| O Structured OutputJSON schema compliance | 100% | 12 / 12 | 6 | 0 | 0 |
Run Details
- Model
- StepFun Step 3.7 Flash ROCmFP4 STRIX Lean Q4_0
- Draft model
- StepFun Step 3.7 Flash MTP Draft Q8_0
- API model id
- step37-rocmfp4-mtp-vulkan-64k-tool-eval-full-templatefix-toolobs
- Backend
- llama.cpp / Vulkan target + Vulkan draft-MTP
- Runtime
- 64K context, parallel=1, spec-draft-n-max=2, target/draft KV q8_0
- Sampling
- temperature=0, seed=42, thinking enabled
- Run slug
- 20260704T-step37-rocmfp4-mtp-vulkan-tool-eval-full-thinking-required-once
- Token metrics
- 36,013 prompt tokens, 32,009 generated tokens, 229,859 harness-reported total tokens
- Artifacts
- tool-eval-bench-full.json, token-metrics-summary.json, raw metrics and slot snapshots
- Reference date
- 2026-03-20 harness default
