Ciru Inference Lab llm.ciru.ai / research

Crown Citadel Research Report

AgentWorld as a Public-Contract World Model

We tested Qwen-AgentWorld as an environment simulator and public-contract sidecar for agent benchmarks. The traces were useful, but the current experiment did not prove a general model-quality lift. The cleanest BigCodeBench slice stayed at 3/6, and the HermesAgent-20 HA-17 variants stayed at 70/100.

STATUS: NOT PROVEN AgentWorld role: sidecar / observation model Report date: 2026-06-26 Primary evidence: local traces, JSON scores, llama-server logs

3/6Clean structural BigCodeBench slice

0.50Clean slice pass@1

70HA-17 assisted score

85HA-16 assisted score, still fail

60.95-62.20AgentWorld FP4 NOMTP TG tok/s observed

Executive Finding

Useful Trace, Unproven Lift

The experiment should be preserved as a negative result with useful diagnostics. AgentWorld made public-contract and environment assumptions visible, but once custom gates were removed, the benchmark evidence did not support a broad quality claim.

What worked AgentWorld produced inspectable observations: public requirements, declared resources, unknown resources, exact strings, and risk notes. Those traces helped explain why models failed and where controller contracts were missing.

What failed The harness drifted toward task-shaped validators. The best-looking BigCodeBench number, 5/6, came from custom gates and is diagnostic only.

Current conclusion We have not proved the model. The valid next step is an apples-to-apples Ace-alone versus Ace+AgentWorld rerun with no behavioral gates and official scoring only.

Stop condition: do not claim "AgentWorld improved Ace" from the gated 5/6 slice, the targeted BCB13/17 reprise passes, or HA-16/HA-17 partial scores. Those are evidence for diagnostics and harness design, not a leaderboard-quality result.

Autoresearch Frame

Original Hypotheses

The HermesAgent-20 HA-17 experiment framed AgentWorld as an advisor beside a separate executor, not as a replacement benchmark runner.

ID	Claim	Prediction	Observed state
H1	AgentWorld is best used as a simulator/advisor, not as the tool executor.	Advice logged at the harness boundary should be easier to score and debug.	supported for debugging Traces were inspectable and helped isolate failure layers.
H2	HA-17 failure is orchestration and event shape, not arithmetic difficulty.	A single batched delegation event plus exact summary schema should score 100.	partly supported Batching worked, values were right, schema was wrong.
H3	AgentWorld FP4 lean is the first two-model profile to try.	128k or 262k context should fit beside another model and avoid card-context mismatch.	fit tested Served successfully; quality proof did not follow.
H4	The harness must test real controller behavior.	Every proposed tool call should pass through AgentWorld preflight, then real artifacts should be scored.	validated as methodology Real artifacts exposed schema and content failures.

Source: archived HermesAgent-20 HA-17 experiment notes and evaluator outputs.

System Design

AgentWorld Stayed Outside the Score

The correct design boundary was: public task in, AgentWorld observation out, controller contract compiled, executor writes artifacts, official verifier scores. The reportable score must come from the real benchmark evaluator or verifier, never from AgentWorld's own judgment.

Public taskPrompt, starter code, visible files, declared tool schemas.

ControllerBuilds state and proposes an action or contract extraction.

AgentWorldPredicts environment observation in official world-model shape.

CompilerTurns observation into typed constraints and risk notes.

ExecutorAce/Hermes performs real tool calls or code generation.

VerifierOfficial BigCodeBench or HA verifier decides pass/fail.

Ground truth boundary: AgentWorld traces can explain public-contract risks, but they cannot decide pass/fail for BigCodeBench or HermesAgent-20. The score source stays the official evaluator.

Experiment Map

What We Tried and Why

The project moved from raw AgentWorld generation, to official-shaped world-model prompting, to controller/constraint experiments. Each step improved observability, but the final score evidence remained insufficient.

Attempt	Reason	Settings or change	Observed result	Lesson
Direct BigCodeBench AgentWorld	Check whether AgentWorld could solve code tasks directly.	Old direct codegen, low output cap.	invalid Cap saturation, submitted <think>, partial reasoning, missing task_func.	Do not report the old 5/148; add validation before scoring.
Guarded direct smoke	Separate parser/cap failures from real task failures.	MAX_NEW_TOKENS=65536, validation for syntax, entry point, no think tags.	BCB13 and BCB15 generated valid code but failed official tests.	Some failures became legitimate model/task failures rather than harness failures.
Official-shaped AgentWorld + Ace loop	Use AgentWorld as intended: state/action to observation.	system_str plus Action plus "Predict the next Environment Observation".	Traces captured public contracts and model handoffs.	Observability improved, but prose advice was too weak unless compiled.
Custom gated v5 diagnostic	See if task-specific public smokes could repair known failure modes.	Controller behavior gates and public validation smokes.	5/6 diagnostic pass@1=0.8333.	Diagnostic only; gates were too task-shaped for a fair benchmark method.
Structural-only clean slice	Remove custom behavior gates.	No custom behavioral gates; structural checks only.	3/6 pass@1=0.5, no generation failures.	No clear evidence that AgentWorld reliably improved Ace.
HA-16/HA-17 controller-assisted runs	Test real agent workflow rather than synthetic BCB only.	AgentWorld preflight, controller decisions, real tool execution.	HA-16 moved from 30 to 85 but failed; HA-17 stayed 70.	Traces solved subproblems; output-contract enforcement was still missing.

HermesAgent Evidence

HA-16 and HA-17: Solved Subproblems, Not the Benchmarks

The HermesAgent-20 run is the best local baseline for agent workflow shape. The full official-20 run averaged 78.5 raw, imported into the quality store as 0.79. HA-16 and HA-17 remained useful because they isolate target resolution and output schema failures.

78.5Full HermesAgent-20 average score

12 / 20Pass count in full run

363.842sFull official-20 elapsed time

0.79Imported quality-store average

HA-16: Message Target Resolution

Run	Score	What failed before	What AgentWorld/controller fixed	Why it still failed
Full HermesAgent-20 HA-16	30	Delivery was undefined even after listing targets.	-	The intended engineering target was not reached.
AgentWorld-assisted HA-16	85	Target name/id ambiguity.	AgentWorld observation exposed notify_engineering and told the controller to use the target id, not display name. The run sent one message successfully to notify_engineering.	readSummary=false and contentScore=0; the harness required reading the file content first.

HA-17: Parallel Delegation

Variant	Duration	Status	Score	What happened
128k assisted run	159.89s	final	70	Single batched delegation, correct values, wrong output keys.
262k card-parameter run	230.506s	final	70	Three AgentWorld runs, three controller runs, thirteen tool events; still wrote sum_a, sorted_b, duplicates_c.
No-thinking contract-prompt run	185.512s	final	70	Contract prompting did not fix hidden output-schema ambiguity.
Tolerant controller-parse run	32.546s	final	70	Parser tolerance fixed controller fragility, not verifier schema.

falseHA-17 exactMatch

1delegateCount

truebatchedDelegate

HA-17 lesson: AgentWorld/controller work preserved the batched delegation shape and computed the values. It did not know the verifier's normalized schema. The next harness needs an explicit public output-contract layer, not more sampling.

BigCodeBench Evidence

The Clean Slice Is the Reportable Result

BigCodeBench is not a native AgentWorldBench task. The safest preserved comparison is the structural-only slice where custom behavioral gates were removed and the official evaluator decided pass/fail.

2/6Early no-thinking slice, pass@1 0.3333

5/6Custom gated v5 diagnostic, pass@1 0.8333

3/6Clean structural-only slice, pass@1 0.5

Early no-thinking

0.333

Custom gated v5

0.833

Structural-only

0.500

Interpretation: the custom-gated result is not fair official benchmark evidence. It showed that hand-built validators could push the executor toward hidden or semi-hidden expectations. The structural-only slice is the honest result.

Clean Structural-Only Slice

Metric	Value	Evidence basis
Run tag	Clean six-task structural-only slice	Preserved metrics and evaluator output
Samples	6	Preserved metrics summary
Passed / failed	3 / 3	Preserved metrics summary and official evaluator output
Generation failures	0	generation_summary.failures=[]
Codegen / eval wall time	140s / 24s	Preserved metrics summary
Ground-truth pass rate	1.0	Preserved metrics summary

Per-Task Outcome

Task	Status	AgentWorld turns	AgentWorld TG	Coder TG	Main failure or pass note
BigCodeBench/13	fail	3	58.98	119.48	FTP listing/download behavior did not match official tests; eval asserted 0 != 2.
BigCodeBench/15	fail	2	59.64	113.03	Error text missed exact phrase: Error executing command.
BigCodeBench/17	fail	2	59.64	102.39	Process lookup and subprocess.Popen call shape differed from tests.
BigCodeBench/19	pass	2	59.65	102.53	Zip files task passed.
BigCodeBench/34	pass	2	59.54	84.79	WordCloud plotting task passed.
BigCodeBench/37	pass	2	59.29	100.28	RandomForest feature-importance plot task passed.

Trace Examples

Where AgentWorld Helped

These examples are the strongest positive evidence, but each one has a boundary. They show traces solving or isolating subproblems that previous runs missed; they do not prove a general benchmark lift.

Case	Previously failed	Trace/help signal	Later result	Claim boundary
HA-16	Full run score 30; delivery undefined.	AgentWorld listed notify_engineering and instructed the controller to use target id, not display name.	Assisted score 85, one successful send to notify_engineering.	Target-resolution subproblem solved; benchmark still failed because the file was not read.
BigCodeBench/17	Structural-only failed process-detection and Popen([process_name]) shape tests.	Trace exposed exact public strings and process-management surfaces, making the API-shape failure obvious.	Targeted process-management reprises reached pass@1=1.0.	Diagnostic reprise, not a clean official-suite improvement.
BigCodeBench/13	Structural-only failed FTP mocked download count with AssertionError: 0 != 2.	Trace captured the public wget requirement, exact exception strings, and unknown remote file risks.	Targeted FTP download reprise reached pass@1=1.0.	Targeted diagnostic, not reportable as a general benchmark score.
BigCodeBench/199	Older Ace/SABER batch failed three tests with ValueError: Not naive datetime (tzinfo is already set) from pytz.tzname.	AgentWorld guidance emphasized local-time format YYYY-MM-DD HH:MM:SS ZZZ; submitted code used strftime('%Z') instead of tzname(aware_datetime).	AgentWorld+Ace sanity run and current Ace overlap rerun both reached pass@1=1.0.	Likely public-contract shaping; not isolated proof because the current Ace overlap rerun also passed.
BigCodeBench/82	Ace hard-depended on undeclared Flask templates.	AgentWorld identified template filenames as unknown/undeclared, which inspired the controller-contract design.	All preserved BCB82 trace variants still failed official evaluation.	Excellent failure-analysis example; not a solved example.

Most useful positive result: the traces made invisible contract failures visible. The result was better diagnosis and targeted repairs, not a validated general sidecar method.

Speed, Tokens, Memory

Overhead Was Real

The two-model loop adds calls, tokens, and wall time. The preserved logs have good PP/TG and token-usage evidence. They do not contain reliable peak VRAM/RAM deltas, so memory is reported as startup snapshots, model file sizes, and checkpoint sizes only.

AgentWorld Serving Profile

Field	Value
Model	Qwen-AgentWorld 35B A3B ROCmFP4 Strix lean, no-MTP profile
Model file size	18,597,338,016 bytes
Runtime	llama-server, ROCm0, FP4 lean, no MTP, --metrics
Context	n_ctx=131072, model train context 262144
KV/cache	ctk q8_0, ctv q8_0, prompt cache disabled
Batch / ubatch	-b 2048, -ub 512
Device snapshot	ROCm0 Radeon 8060S Graphics: 126976 MiB total, 92929 MiB free at startup
CPU memory snapshot	AMD Ryzen AI Max+ 395: 126431 MiB free at startup
Context checkpoint size	62.813 MiB per checkpoint in log snippets

Observed PP/TG From Server Log

Prompt tokens	Prompt eval time	PP tok/s	Generated tokens	Eval time	TG tok/s
1186	1026.55 ms	1155.33	768	12382.72 ms	62.02
526	498.36 ms	1055.47	768	12352.43 ms	62.17
432	384.99 ms	1122.10	768	12346.63 ms	62.20
431	713.21 ms	604.31	4741	77787.73 ms	60.95

AgentWorld Token Usage Examples

Run	Turn	Finish	Prompt tokens	Completion tokens	Total tokens	Cached tokens
HA-17 assisted run	1	stop	1189	4753	5942	0
HA-17 assisted run	2	stop	1102	4239	5341	673
HA-17 assisted run	3	stop	1186	2821	4007	673
HA-16 assisted	1	stop	865	3923	4788	0
HA-16 assisted	2	stop	907	2724	3631	349

BigCodeBench Trace Metrics

Task / run	Pass@1	AgentWorld generated tokens	AgentWorld TG	Coder generated tokens	Coder TG	Wall note
Clean BCB13	0.0	1586	58.98	779	119.48	Failed official evaluator.
Clean BCB17	0.0	436	59.64	429	102.39	Failed official evaluator.
Targeted BCB17 process-management reprise	1.0	680	59.23	409	112.67	Diagnostic reprise only.
BCB199 AgentWorld sanity	1.0	1444	57.55	717	123.81	Public-contract success case.

Memory caveat: no reliable idle/peak VRAM or system-RAM deltas were preserved in the AgentWorld/HermesAgent run JSONs. The report therefore lists startup device snapshots, model file sizes, and context-checkpoint sizes instead of inventing a peak-memory metric.

Lessons

What We Learned

The experiment produced a clear harness lesson: AgentWorld's observations become useful only when transformed into typed, enforceable public contracts.

Simulated observations are not benchmark observations. AgentWorld can predict plausible environment state, but BigCodeBench and HA verifiers decide score.

Public contracts must be structured. Free-form guidance was too easy for the executor to ignore or misapply.

Custom gates can become surrogate scorers. The v5 gates improved a slice score, but they were not acceptable general benchmark evidence.

Trace data is valuable. It separated parser failures, cap failures, API-shape failures, output-schema failures, and official evaluator failures.

Next Clean Experiment

Requirement	Reason
Ace alone vs Ace+AgentWorld	Same sampler, context, evaluator, and task set are required to claim lift.
No custom behavioral gates	Use structural guards only: sample count, valid code, no think blocks, entry point present, no cap saturation.
Pre-registered success criteria	Avoid deciding after the fact which diagnostic passes count.
Persist controller contracts	Every task should preserve raw observation, compiled contract, executor prompt, code, validation, and official eval.
Report overhead with quality	The intended method runs two models; extra tokens and wall time are part of the result.

Evidence Index

Evidence Sources

These are the evidence classes used to compile the report. Public labels are used here; internal paths and artifact filenames are intentionally omitted.

AgentWorld research archive - preserved conclusion, negative-result framing, and do-not-reuse guidance.
BigCodeBench postmortem - clean structural-only slice, diagnostic gated slice, and caveats.
Clean six-task metrics summary - sample counts, pass/fail counts, pass@1, wall time, and generation-failure checks.
Representative AgentWorld traces - BCB13 and BCB17 public-contract observations and failure analysis.
HermesAgent-20 full run summary - official-20 average, pass/fail counts, and HA-16/HA-17 baseline behavior.
HA-16 assisted run trace - target-resolution improvement and remaining file-read/content failure.
HA-17 assisted run traces - delegation shape, controller behavior, schema mismatch, and timing.
AgentWorld serving telemetry - PP/TG measurements and startup device-memory snapshots.
Guarded-runner notes - traced-loop settings, validation guards, and invalid direct-run caveats.
Prior BCB199 evaluator output - datetime/time-zone failure evidence used for the public-contract comparison.
HermesAgent-20 HA-17 autoresearch notes - original hypotheses, experiment log, and findings.

Final conclusion: the current evidence supports continuing AgentWorld as an audit and public-contract trace tool. It does not yet prove that AgentWorld improves the model or the benchmark score under clean, controlled conditions.