Ciru Inference Lab
llm.ciru.ai / research
AgentWorld as a Public-Contract World Model
We tested Qwen-AgentWorld as an environment simulator and public-contract sidecar for agent benchmarks. The traces were useful, but the current experiment did not prove a general model-quality lift. The cleanest BigCodeBench slice stayed at 3/6, and the HermesAgent-20 HA-17 variants stayed at 70/100.
Useful Trace, Unproven Lift
The experiment should be preserved as a negative result with useful diagnostics. AgentWorld made public-contract and environment assumptions visible, but once custom gates were removed, the benchmark evidence did not support a broad quality claim.
Original Hypotheses
The HermesAgent-20 HA-17 experiment framed AgentWorld as an advisor beside a separate executor, not as a replacement benchmark runner.
| ID | Claim | Prediction | Observed state |
|---|---|---|---|
| H1 | AgentWorld is best used as a simulator/advisor, not as the tool executor. | Advice logged at the harness boundary should be easier to score and debug. | supported for debugging Traces were inspectable and helped isolate failure layers. |
| H2 | HA-17 failure is orchestration and event shape, not arithmetic difficulty. | A single batched delegation event plus exact summary schema should score 100. | partly supported Batching worked, values were right, schema was wrong. |
| H3 | AgentWorld FP4 lean is the first two-model profile to try. | 128k or 262k context should fit beside another model and avoid card-context mismatch. | fit tested Served successfully; quality proof did not follow. |
| H4 | The harness must test real controller behavior. | Every proposed tool call should pass through AgentWorld preflight, then real artifacts should be scored. | validated as methodology Real artifacts exposed schema and content failures. |
Source: archived HermesAgent-20 HA-17 experiment notes and evaluator outputs.
AgentWorld Stayed Outside the Score
The correct design boundary was: public task in, AgentWorld observation out, controller contract compiled, executor writes artifacts, official verifier scores. The reportable score must come from the real benchmark evaluator or verifier, never from AgentWorld's own judgment.
What We Tried and Why
The project moved from raw AgentWorld generation, to official-shaped world-model prompting, to controller/constraint experiments. Each step improved observability, but the final score evidence remained insufficient.
| Attempt | Reason | Settings or change | Observed result | Lesson |
|---|---|---|---|---|
| Direct BigCodeBench AgentWorld | Check whether AgentWorld could solve code tasks directly. | Old direct codegen, low output cap. | invalid Cap saturation, submitted <think>, partial reasoning, missing task_func. | Do not report the old 5/148; add validation before scoring. |
| Guarded direct smoke | Separate parser/cap failures from real task failures. | MAX_NEW_TOKENS=65536, validation for syntax, entry point, no think tags. | BCB13 and BCB15 generated valid code but failed official tests. | Some failures became legitimate model/task failures rather than harness failures. |
| Official-shaped AgentWorld + Ace loop | Use AgentWorld as intended: state/action to observation. | system_str plus Action plus "Predict the next Environment Observation". | Traces captured public contracts and model handoffs. | Observability improved, but prose advice was too weak unless compiled. |
| Custom gated v5 diagnostic | See if task-specific public smokes could repair known failure modes. | Controller behavior gates and public validation smokes. | 5/6 diagnostic pass@1=0.8333. | Diagnostic only; gates were too task-shaped for a fair benchmark method. |
| Structural-only clean slice | Remove custom behavior gates. | No custom behavioral gates; structural checks only. | 3/6 pass@1=0.5, no generation failures. | No clear evidence that AgentWorld reliably improved Ace. |
| HA-16/HA-17 controller-assisted runs | Test real agent workflow rather than synthetic BCB only. | AgentWorld preflight, controller decisions, real tool execution. | HA-16 moved from 30 to 85 but failed; HA-17 stayed 70. | Traces solved subproblems; output-contract enforcement was still missing. |
HA-16 and HA-17: Solved Subproblems, Not the Benchmarks
The HermesAgent-20 run is the best local baseline for agent workflow shape. The full official-20 run averaged 78.5 raw, imported into the quality store as 0.79. HA-16 and HA-17 remained useful because they isolate target resolution and output schema failures.
HA-16: Message Target Resolution
| Run | Score | What failed before | What AgentWorld/controller fixed | Why it still failed |
|---|---|---|---|---|
| Full HermesAgent-20 HA-16 | 30 | Delivery was undefined even after listing targets. | - | The intended engineering target was not reached. |
| AgentWorld-assisted HA-16 | 85 | Target name/id ambiguity. | AgentWorld observation exposed notify_engineering and told the controller to use the target id, not display name. The run sent one message successfully to notify_engineering. | readSummary=false and contentScore=0; the harness required reading the file content first. |
HA-17: Parallel Delegation
| Variant | Duration | Status | Score | What happened |
|---|---|---|---|---|
| 128k assisted run | 159.89s | final | 70 | Single batched delegation, correct values, wrong output keys. |
| 262k card-parameter run | 230.506s | final | 70 | Three AgentWorld runs, three controller runs, thirteen tool events; still wrote sum_a, sorted_b, duplicates_c. |
| No-thinking contract-prompt run | 185.512s | final | 70 | Contract prompting did not fix hidden output-schema ambiguity. |
| Tolerant controller-parse run | 32.546s | final | 70 | Parser tolerance fixed controller fragility, not verifier schema. |
The Clean Slice Is the Reportable Result
BigCodeBench is not a native AgentWorldBench task. The safest preserved comparison is the structural-only slice where custom behavioral gates were removed and the official evaluator decided pass/fail.
Clean Structural-Only Slice
| Metric | Value | Evidence basis |
|---|---|---|
| Run tag | Clean six-task structural-only slice | Preserved metrics and evaluator output |
| Samples | 6 | Preserved metrics summary |
| Passed / failed | 3 / 3 | Preserved metrics summary and official evaluator output |
| Generation failures | 0 | generation_summary.failures=[] |
| Codegen / eval wall time | 140s / 24s | Preserved metrics summary |
| Ground-truth pass rate | 1.0 | Preserved metrics summary |
Per-Task Outcome
| Task | Status | AgentWorld turns | AgentWorld TG | Coder TG | Main failure or pass note |
|---|---|---|---|---|---|
| BigCodeBench/13 | fail | 3 | 58.98 | 119.48 | FTP listing/download behavior did not match official tests; eval asserted 0 != 2. |
| BigCodeBench/15 | fail | 2 | 59.64 | 113.03 | Error text missed exact phrase: Error executing command. |
| BigCodeBench/17 | fail | 2 | 59.64 | 102.39 | Process lookup and subprocess.Popen call shape differed from tests. |
| BigCodeBench/19 | pass | 2 | 59.65 | 102.53 | Zip files task passed. |
| BigCodeBench/34 | pass | 2 | 59.54 | 84.79 | WordCloud plotting task passed. |
| BigCodeBench/37 | pass | 2 | 59.29 | 100.28 | RandomForest feature-importance plot task passed. |
Where AgentWorld Helped
These examples are the strongest positive evidence, but each one has a boundary. They show traces solving or isolating subproblems that previous runs missed; they do not prove a general benchmark lift.
| Case | Previously failed | Trace/help signal | Later result | Claim boundary |
|---|---|---|---|---|
| HA-16 | Full run score 30; delivery undefined. | AgentWorld listed notify_engineering and instructed the controller to use target id, not display name. | Assisted score 85, one successful send to notify_engineering. | Target-resolution subproblem solved; benchmark still failed because the file was not read. |
| BigCodeBench/17 | Structural-only failed process-detection and Popen([process_name]) shape tests. | Trace exposed exact public strings and process-management surfaces, making the API-shape failure obvious. | Targeted process-management reprises reached pass@1=1.0. | Diagnostic reprise, not a clean official-suite improvement. |
| BigCodeBench/13 | Structural-only failed FTP mocked download count with AssertionError: 0 != 2. | Trace captured the public wget requirement, exact exception strings, and unknown remote file risks. | Targeted FTP download reprise reached pass@1=1.0. | Targeted diagnostic, not reportable as a general benchmark score. |
| BigCodeBench/199 | Older Ace/SABER batch failed three tests with ValueError: Not naive datetime (tzinfo is already set) from pytz.tzname. | AgentWorld guidance emphasized local-time format YYYY-MM-DD HH:MM:SS ZZZ; submitted code used strftime('%Z') instead of tzname(aware_datetime). | AgentWorld+Ace sanity run and current Ace overlap rerun both reached pass@1=1.0. | Likely public-contract shaping; not isolated proof because the current Ace overlap rerun also passed. |
| BigCodeBench/82 | Ace hard-depended on undeclared Flask templates. | AgentWorld identified template filenames as unknown/undeclared, which inspired the controller-contract design. | All preserved BCB82 trace variants still failed official evaluation. | Excellent failure-analysis example; not a solved example. |
Overhead Was Real
The two-model loop adds calls, tokens, and wall time. The preserved logs have good PP/TG and token-usage evidence. They do not contain reliable peak VRAM/RAM deltas, so memory is reported as startup snapshots, model file sizes, and checkpoint sizes only.
AgentWorld Serving Profile
| Field | Value |
|---|---|
| Model | Qwen-AgentWorld 35B A3B ROCmFP4 Strix lean, no-MTP profile |
| Model file size | 18,597,338,016 bytes |
| Runtime | llama-server, ROCm0, FP4 lean, no MTP, --metrics |
| Context | n_ctx=131072, model train context 262144 |
| KV/cache | ctk q8_0, ctv q8_0, prompt cache disabled |
| Batch / ubatch | -b 2048, -ub 512 |
| Device snapshot | ROCm0 Radeon 8060S Graphics: 126976 MiB total, 92929 MiB free at startup |
| CPU memory snapshot | AMD Ryzen AI Max+ 395: 126431 MiB free at startup |
| Context checkpoint size | 62.813 MiB per checkpoint in log snippets |
Observed PP/TG From Server Log
| Prompt tokens | Prompt eval time | PP tok/s | Generated tokens | Eval time | TG tok/s |
|---|---|---|---|---|---|
| 1186 | 1026.55 ms | 1155.33 | 768 | 12382.72 ms | 62.02 |
| 526 | 498.36 ms | 1055.47 | 768 | 12352.43 ms | 62.17 |
| 432 | 384.99 ms | 1122.10 | 768 | 12346.63 ms | 62.20 |
| 431 | 713.21 ms | 604.31 | 4741 | 77787.73 ms | 60.95 |
AgentWorld Token Usage Examples
| Run | Turn | Finish | Prompt tokens | Completion tokens | Total tokens | Cached tokens |
|---|---|---|---|---|---|---|
| HA-17 assisted run | 1 | stop | 1189 | 4753 | 5942 | 0 |
| HA-17 assisted run | 2 | stop | 1102 | 4239 | 5341 | 673 |
| HA-17 assisted run | 3 | stop | 1186 | 2821 | 4007 | 673 |
| HA-16 assisted | 1 | stop | 865 | 3923 | 4788 | 0 |
| HA-16 assisted | 2 | stop | 907 | 2724 | 3631 | 349 |
BigCodeBench Trace Metrics
| Task / run | Pass@1 | AgentWorld generated tokens | AgentWorld TG | Coder generated tokens | Coder TG | Wall note |
|---|---|---|---|---|---|---|
| Clean BCB13 | 0.0 | 1586 | 58.98 | 779 | 119.48 | Failed official evaluator. |
| Clean BCB17 | 0.0 | 436 | 59.64 | 429 | 102.39 | Failed official evaluator. |
| Targeted BCB17 process-management reprise | 1.0 | 680 | 59.23 | 409 | 112.67 | Diagnostic reprise only. |
| BCB199 AgentWorld sanity | 1.0 | 1444 | 57.55 | 717 | 123.81 | Public-contract success case. |
What We Learned
The experiment produced a clear harness lesson: AgentWorld's observations become useful only when transformed into typed, enforceable public contracts.
Next Clean Experiment
| Requirement | Reason |
|---|---|
| Ace alone vs Ace+AgentWorld | Same sampler, context, evaluator, and task set are required to claim lift. |
| No custom behavioral gates | Use structural guards only: sample count, valid code, no think blocks, entry point present, no cap saturation. |
| Pre-registered success criteria | Avoid deciding after the fact which diagnostic passes count. |
| Persist controller contracts | Every task should preserve raw observation, compiled contract, executor prompt, code, validation, and official eval. |
| Report overhead with quality | The intended method runs two models; extra tokens and wall time are part of the result. |
Evidence Sources
These are the evidence classes used to compile the report. Public labels are used here; internal paths and artifact filenames are intentionally omitted.
- AgentWorld research archive - preserved conclusion, negative-result framing, and do-not-reuse guidance.
- BigCodeBench postmortem - clean structural-only slice, diagnostic gated slice, and caveats.
- Clean six-task metrics summary - sample counts, pass/fail counts, pass@1, wall time, and generation-failure checks.
- Representative AgentWorld traces - BCB13 and BCB17 public-contract observations and failure analysis.
- HermesAgent-20 full run summary - official-20 average, pass/fail counts, and HA-16/HA-17 baseline behavior.
- HA-16 assisted run trace - target-resolution improvement and remaining file-read/content failure.
- HA-17 assisted run traces - delegation shape, controller behavior, schema mismatch, and timing.
- AgentWorld serving telemetry - PP/TG measurements and startup device-memory snapshots.
- Guarded-runner notes - traced-loop settings, validation guards, and invalid direct-run caveats.
- Prior BCB199 evaluator output - datetime/time-zone failure evidence used for the public-contract comparison.
- HermesAgent-20 HA-17 autoresearch notes - original hypotheses, experiment log, and findings.