Crown Citadel Group Ciru Inference Lab llm.ciru.ai / research
Crown Citadel Research Report

AgentWorld as a Public-Contract World Model

We tested Qwen-AgentWorld as an environment simulator and public-contract sidecar for agent benchmarks. The traces were useful, but the current experiment did not prove a general model-quality lift. The cleanest BigCodeBench slice stayed at 3/6, and the HermesAgent-20 HA-17 variants stayed at 70/100.

STATUS: NOT PROVEN AgentWorld role: sidecar / observation model Report date: 2026-06-26 Primary evidence: local traces, JSON scores, llama-server logs
3/6Clean structural BigCodeBench slice
0.50Clean slice pass@1
70HA-17 assisted score
85HA-16 assisted score, still fail
60.95-62.20AgentWorld FP4 NOMTP TG tok/s observed
Executive Finding

Useful Trace, Unproven Lift

The experiment should be preserved as a negative result with useful diagnostics. AgentWorld made public-contract and environment assumptions visible, but once custom gates were removed, the benchmark evidence did not support a broad quality claim.

What worked AgentWorld produced inspectable observations: public requirements, declared resources, unknown resources, exact strings, and risk notes. Those traces helped explain why models failed and where controller contracts were missing.
What failed The harness drifted toward task-shaped validators. The best-looking BigCodeBench number, 5/6, came from custom gates and is diagnostic only.
Current conclusion We have not proved the model. The valid next step is an apples-to-apples Ace-alone versus Ace+AgentWorld rerun with no behavioral gates and official scoring only.
Stop condition: do not claim "AgentWorld improved Ace" from the gated 5/6 slice, the targeted BCB13/17 reprise passes, or HA-16/HA-17 partial scores. Those are evidence for diagnostics and harness design, not a leaderboard-quality result.
Autoresearch Frame

Original Hypotheses

The HermesAgent-20 HA-17 experiment framed AgentWorld as an advisor beside a separate executor, not as a replacement benchmark runner.

ID Claim Prediction Observed state
H1 AgentWorld is best used as a simulator/advisor, not as the tool executor. Advice logged at the harness boundary should be easier to score and debug. supported for debugging Traces were inspectable and helped isolate failure layers.
H2 HA-17 failure is orchestration and event shape, not arithmetic difficulty. A single batched delegation event plus exact summary schema should score 100. partly supported Batching worked, values were right, schema was wrong.
H3 AgentWorld FP4 lean is the first two-model profile to try. 128k or 262k context should fit beside another model and avoid card-context mismatch. fit tested Served successfully; quality proof did not follow.
H4 The harness must test real controller behavior. Every proposed tool call should pass through AgentWorld preflight, then real artifacts should be scored. validated as methodology Real artifacts exposed schema and content failures.

Source: archived HermesAgent-20 HA-17 experiment notes and evaluator outputs.

System Design

AgentWorld Stayed Outside the Score

The correct design boundary was: public task in, AgentWorld observation out, controller contract compiled, executor writes artifacts, official verifier scores. The reportable score must come from the real benchmark evaluator or verifier, never from AgentWorld's own judgment.

Public taskPrompt, starter code, visible files, declared tool schemas.
ControllerBuilds state and proposes an action or contract extraction.
AgentWorldPredicts environment observation in official world-model shape.
CompilerTurns observation into typed constraints and risk notes.
ExecutorAce/Hermes performs real tool calls or code generation.
VerifierOfficial BigCodeBench or HA verifier decides pass/fail.
Ground truth boundary: AgentWorld traces can explain public-contract risks, but they cannot decide pass/fail for BigCodeBench or HermesAgent-20. The score source stays the official evaluator.
Experiment Map

What We Tried and Why

The project moved from raw AgentWorld generation, to official-shaped world-model prompting, to controller/constraint experiments. Each step improved observability, but the final score evidence remained insufficient.

Attempt Reason Settings or change Observed result Lesson
Direct BigCodeBench AgentWorld Check whether AgentWorld could solve code tasks directly. Old direct codegen, low output cap. invalid Cap saturation, submitted <think>, partial reasoning, missing task_func. Do not report the old 5/148; add validation before scoring.
Guarded direct smoke Separate parser/cap failures from real task failures. MAX_NEW_TOKENS=65536, validation for syntax, entry point, no think tags. BCB13 and BCB15 generated valid code but failed official tests. Some failures became legitimate model/task failures rather than harness failures.
Official-shaped AgentWorld + Ace loop Use AgentWorld as intended: state/action to observation. system_str plus Action plus "Predict the next Environment Observation". Traces captured public contracts and model handoffs. Observability improved, but prose advice was too weak unless compiled.
Custom gated v5 diagnostic See if task-specific public smokes could repair known failure modes. Controller behavior gates and public validation smokes. 5/6 diagnostic pass@1=0.8333. Diagnostic only; gates were too task-shaped for a fair benchmark method.
Structural-only clean slice Remove custom behavior gates. No custom behavioral gates; structural checks only. 3/6 pass@1=0.5, no generation failures. No clear evidence that AgentWorld reliably improved Ace.
HA-16/HA-17 controller-assisted runs Test real agent workflow rather than synthetic BCB only. AgentWorld preflight, controller decisions, real tool execution. HA-16 moved from 30 to 85 but failed; HA-17 stayed 70. Traces solved subproblems; output-contract enforcement was still missing.
HermesAgent Evidence

HA-16 and HA-17: Solved Subproblems, Not the Benchmarks

The HermesAgent-20 run is the best local baseline for agent workflow shape. The full official-20 run averaged 78.5 raw, imported into the quality store as 0.79. HA-16 and HA-17 remained useful because they isolate target resolution and output schema failures.

78.5Full HermesAgent-20 average score
12 / 20Pass count in full run
363.842sFull official-20 elapsed time
0.79Imported quality-store average

HA-16: Message Target Resolution

Run Score What failed before What AgentWorld/controller fixed Why it still failed
Full HermesAgent-20 HA-16 30 Delivery was undefined even after listing targets. - The intended engineering target was not reached.
AgentWorld-assisted HA-16 85 Target name/id ambiguity. AgentWorld observation exposed notify_engineering and told the controller to use the target id, not display name. The run sent one message successfully to notify_engineering. readSummary=false and contentScore=0; the harness required reading the file content first.

HA-17: Parallel Delegation

Variant Duration Status Score What happened
128k assisted run 159.89s final 70 Single batched delegation, correct values, wrong output keys.
262k card-parameter run 230.506s final 70 Three AgentWorld runs, three controller runs, thirteen tool events; still wrote sum_a, sorted_b, duplicates_c.
No-thinking contract-prompt run 185.512s final 70 Contract prompting did not fix hidden output-schema ambiguity.
Tolerant controller-parse run 32.546s final 70 Parser tolerance fixed controller fragility, not verifier schema.
falseHA-17 exactMatch
1delegateCount
truebatchedDelegate
HA-17 lesson: AgentWorld/controller work preserved the batched delegation shape and computed the values. It did not know the verifier's normalized schema. The next harness needs an explicit public output-contract layer, not more sampling.
BigCodeBench Evidence

The Clean Slice Is the Reportable Result

BigCodeBench is not a native AgentWorldBench task. The safest preserved comparison is the structural-only slice where custom behavioral gates were removed and the official evaluator decided pass/fail.

2/6Early no-thinking slice, pass@1 0.3333
5/6Custom gated v5 diagnostic, pass@1 0.8333
3/6Clean structural-only slice, pass@1 0.5
Early no-thinking
0.333
Custom gated v5
0.833
Structural-only
0.500
Interpretation: the custom-gated result is not fair official benchmark evidence. It showed that hand-built validators could push the executor toward hidden or semi-hidden expectations. The structural-only slice is the honest result.

Clean Structural-Only Slice

Metric Value Evidence basis
Run tag Clean six-task structural-only slice Preserved metrics and evaluator output
Samples 6 Preserved metrics summary
Passed / failed 3 / 3 Preserved metrics summary and official evaluator output
Generation failures 0 generation_summary.failures=[]
Codegen / eval wall time 140s / 24s Preserved metrics summary
Ground-truth pass rate 1.0 Preserved metrics summary

Per-Task Outcome

Task Status AgentWorld turns AgentWorld TG Coder TG Main failure or pass note
BigCodeBench/13fail358.98119.48FTP listing/download behavior did not match official tests; eval asserted 0 != 2.
BigCodeBench/15fail259.64113.03Error text missed exact phrase: Error executing command.
BigCodeBench/17fail259.64102.39Process lookup and subprocess.Popen call shape differed from tests.
BigCodeBench/19pass259.65102.53Zip files task passed.
BigCodeBench/34pass259.5484.79WordCloud plotting task passed.
BigCodeBench/37pass259.29100.28RandomForest feature-importance plot task passed.
Trace Examples

Where AgentWorld Helped

These examples are the strongest positive evidence, but each one has a boundary. They show traces solving or isolating subproblems that previous runs missed; they do not prove a general benchmark lift.

Case Previously failed Trace/help signal Later result Claim boundary
HA-16 Full run score 30; delivery undefined. AgentWorld listed notify_engineering and instructed the controller to use target id, not display name. Assisted score 85, one successful send to notify_engineering. Target-resolution subproblem solved; benchmark still failed because the file was not read.
BigCodeBench/17 Structural-only failed process-detection and Popen([process_name]) shape tests. Trace exposed exact public strings and process-management surfaces, making the API-shape failure obvious. Targeted process-management reprises reached pass@1=1.0. Diagnostic reprise, not a clean official-suite improvement.
BigCodeBench/13 Structural-only failed FTP mocked download count with AssertionError: 0 != 2. Trace captured the public wget requirement, exact exception strings, and unknown remote file risks. Targeted FTP download reprise reached pass@1=1.0. Targeted diagnostic, not reportable as a general benchmark score.
BigCodeBench/199 Older Ace/SABER batch failed three tests with ValueError: Not naive datetime (tzinfo is already set) from pytz.tzname. AgentWorld guidance emphasized local-time format YYYY-MM-DD HH:MM:SS ZZZ; submitted code used strftime('%Z') instead of tzname(aware_datetime). AgentWorld+Ace sanity run and current Ace overlap rerun both reached pass@1=1.0. Likely public-contract shaping; not isolated proof because the current Ace overlap rerun also passed.
BigCodeBench/82 Ace hard-depended on undeclared Flask templates. AgentWorld identified template filenames as unknown/undeclared, which inspired the controller-contract design. All preserved BCB82 trace variants still failed official evaluation. Excellent failure-analysis example; not a solved example.
Most useful positive result: the traces made invisible contract failures visible. The result was better diagnosis and targeted repairs, not a validated general sidecar method.
Speed, Tokens, Memory

Overhead Was Real

The two-model loop adds calls, tokens, and wall time. The preserved logs have good PP/TG and token-usage evidence. They do not contain reliable peak VRAM/RAM deltas, so memory is reported as startup snapshots, model file sizes, and checkpoint sizes only.

AgentWorld Serving Profile

Field Value
ModelQwen-AgentWorld 35B A3B ROCmFP4 Strix lean, no-MTP profile
Model file size18,597,338,016 bytes
Runtimellama-server, ROCm0, FP4 lean, no MTP, --metrics
Contextn_ctx=131072, model train context 262144
KV/cachectk q8_0, ctv q8_0, prompt cache disabled
Batch / ubatch-b 2048, -ub 512
Device snapshotROCm0 Radeon 8060S Graphics: 126976 MiB total, 92929 MiB free at startup
CPU memory snapshotAMD Ryzen AI Max+ 395: 126431 MiB free at startup
Context checkpoint size62.813 MiB per checkpoint in log snippets

Observed PP/TG From Server Log

Prompt tokens Prompt eval time PP tok/s Generated tokens Eval time TG tok/s
11861026.55 ms1155.3376812382.72 ms62.02
526498.36 ms1055.4776812352.43 ms62.17
432384.99 ms1122.1076812346.63 ms62.20
431713.21 ms604.31474177787.73 ms60.95

AgentWorld Token Usage Examples

Run Turn Finish Prompt tokens Completion tokens Total tokens Cached tokens
HA-17 assisted run1stop1189475359420
HA-17 assisted run2stop110242395341673
HA-17 assisted run3stop118628214007673
HA-16 assisted1stop865392347880
HA-16 assisted2stop90727243631349

BigCodeBench Trace Metrics

Task / run Pass@1 AgentWorld generated tokens AgentWorld TG Coder generated tokens Coder TG Wall note
Clean BCB130.0158658.98779119.48Failed official evaluator.
Clean BCB170.043659.64429102.39Failed official evaluator.
Targeted BCB17 process-management reprise1.068059.23409112.67Diagnostic reprise only.
BCB199 AgentWorld sanity1.0144457.55717123.81Public-contract success case.
Memory caveat: no reliable idle/peak VRAM or system-RAM deltas were preserved in the AgentWorld/HermesAgent run JSONs. The report therefore lists startup device snapshots, model file sizes, and context-checkpoint sizes instead of inventing a peak-memory metric.
Lessons

What We Learned

The experiment produced a clear harness lesson: AgentWorld's observations become useful only when transformed into typed, enforceable public contracts.

Simulated observations are not benchmark observations. AgentWorld can predict plausible environment state, but BigCodeBench and HA verifiers decide score.
Public contracts must be structured. Free-form guidance was too easy for the executor to ignore or misapply.
Custom gates can become surrogate scorers. The v5 gates improved a slice score, but they were not acceptable general benchmark evidence.
Trace data is valuable. It separated parser failures, cap failures, API-shape failures, output-schema failures, and official evaluator failures.

Next Clean Experiment

Requirement Reason
Ace alone vs Ace+AgentWorldSame sampler, context, evaluator, and task set are required to claim lift.
No custom behavioral gatesUse structural guards only: sample count, valid code, no think blocks, entry point present, no cap saturation.
Pre-registered success criteriaAvoid deciding after the fact which diagnostic passes count.
Persist controller contractsEvery task should preserve raw observation, compiled contract, executor prompt, code, validation, and official eval.
Report overhead with qualityThe intended method runs two models; extra tokens and wall time are part of the result.
Evidence Index

Evidence Sources

These are the evidence classes used to compile the report. Public labels are used here; internal paths and artifact filenames are intentionally omitted.

Final conclusion: the current evidence supports continuing AgentWorld as an audit and public-contract trace tool. It does not yet prove that AgentWorld improves the model or the benchmark score under clean, controlled conditions.