Hermes Tool-Loop Findings Report

Executive Summary

Hermes got stuck in a loop where the assistant said it was about to update a skill, but the structured tool call repeatedly selected the read-only skill_view tool instead of a mutation tool such as skill_manage or patch. The loop was reproduced from persisted chat and tool state in a local replay sandbox.

The strongest finding is that the loop was a context-induced function-name attractor. Repeated historical <function=skill_view> traces and repeated large successful tool results taught the current prompt to continue with the same function name. Natural-language intent was still present, but at the tool-name token boundary it lost to the repeated action trace.

42 persisted messages were exported for the loop window, message ids 12080 through 12121.

20 skill_view tool results appeared in that window; 16 were full 10,774-character skill bodies.

0.930 approximate late-loop probability of the bad skill_view path in the quantized Crown v7 probe.

5 traces were enough to make _view dominant again in the historical marker ablation.

Main conclusion: this was not primarily a WhatsApp transport issue, database-write issue, prompt-cache issue, MTP issue, or simple turn-limit issue. Those factors can change frequency, but the root behavior is schema-grounded action selection collapsing under repeated in-context action traces.

What Happened

The user asked Hermes to update a skill so that all changes would be committed. Hermes correctly inspected the relevant skill and related files. After it had enough information, it repeatedly wrote assistant text like "Now I'll add the git commit requirement to the skill", but then emitted another skill_view call instead of the patch/update call.

Plain-language failure signature

The user asks for a change.
The assistant reads the skill with skill_view.
The assistant says it will now update or patch the skill.
The structured tool call is still another read-only skill_view.
The model receives the same large skill text again, making the pattern easier to repeat.

Concrete loop excerpt

user: Please update the skill to reflect that all changes need to be committed now
assistant: Now I'll add the git commit requirement to the skill.
tool_call: skill_view(...)
tool_result: full retatrutide-tracking skill content

assistant: Now I'll add the git commit requirement to the skill.
tool_call: skill_view(...)
tool_result: full retatrutide-tracking skill content

user: You got stuck trying to add to the skill
assistant: You're right, I got stuck in a loop calling skill_view repeatedly.
           Let me just patch the skill directly.
tool_call: skill_view(...)

That last turn was especially diagnostic: even after the assistant explicitly named the error and said it would patch directly, the structured action again chose skill_view.

Evidence Base

The replay sandbox is local and isolated. It was built from read-only SQLite exports against Hermes state. The harness can replay the model request and replace real tool execution with fake local stubs, so experiments do not modify production skills, the production database, or production services.

Primary fixture

Session 20260610_170626_2b75cab0, source whatsapp, loop window 12080..12121.

/home/crown/hermes-loop-replay-sandbox/fixtures/messages_12080_12121.json

Replay lab

The sandbox asks whether the same persisted chat state chooses another read-only call or a mutation path.

/home/crown/hermes-loop-replay-sandbox

Tool schema

The full WhatsApp runtime-filtered schema exposed 32 tools, not just skill_view and skill_manage. This mattered.

fixtures/whatsapp_tool_schemas.json

Low-level probes

The token probes rendered the prompt through llama.cpp and measured one-token probabilities at the exact function-name decision points.

tools/tool_name_logprob_probe.py

What the fixture showed

Quantity	Observed value	Why it matters
Persisted messages in loop window	42	Enough history to preserve the exact repeated action pattern.
Total `skill_view` tool results	20	The repeated read-only action was not an isolated mistake.
Full skill-body returns	16	Large repeated successful results made the prompt dominated by the same action/result pair.
Total `skill_view` result characters	187,599	The context was heavily polluted by repeated read results.

Mechanism: Why The Loop Happened

The key interface problem is that two opposite actions had the same first token in the tool name.

patch        -> [patch]
terminal     -> [terminal]
skill_view   -> [skill, _view]
skill_manage -> [skill, _manage]

This means the model first chooses the broad skill family. Only after that does it choose between read-only _view and write-capable _manage. Once the model has generated skill, the history of repeated skill_view calls exerts strong pressure on the next token.

The model was not simply choosing from the current schema according to current intent. It was doing next-token continuation over the whole transcript. When historical skill_view markers were rewritten to the fake non-schema name inspect_skill, the model emitted inspect with probability 0.9399.

Context pollution pathway

skill_view is safe and easy to call.
The tool returns useful content, so the transcript marks it as successful.
The assistant says it will now mutate, but the next tool call repeats the previous function name.
The repeated call returns another large copy of the same skill content.
The prompt becomes a stronger induction target for yet another skill_view.

What We Tried

The work progressed from whole-turn replay to token-level diagnosis and then to model-side mitigation attempts. The table below separates what was tested from what was observed.

Experiment	Purpose	Observed result	Interpretation
Two-tool replay	Expose only `skill_view` and `skill_manage`.	Mostly chose `skill_manage`; occasional mismatch.	Under-fit the real environment because production had many visible tools.
Full WhatsApp schema replay	Use the 32-tool runtime-filtered production-like schema.	At cut 12090, 2/8 mismatches; first tools included `skill_view`, `patch`, and `skill_manage`.	The real schema made the failure easier to trigger and introduced mutation-route ambiguity.
Isolated MTP on/off server runs	Check whether speculative MTP was the root cause.	Expanded full-tool sample: MTP on 5/20 mismatch; MTP off 2/20 mismatch.	MTP can amplify marginal decisions, but it is not required.
Prompt-cache tests	Check whether cache reuse caused the loop.	Mismatches occurred with cache disabled and with cache reused.	Prompt cache is not sufficient as a root cause, though it may affect frequency.
Low-level logprob probes	Measure probabilities at `<function=` and `<function=skill`.	At late cut 12107, combined bad path was about 0.930.	The loop became near-argmax at the tool-name level.
Schema ablation: remove `skill_view`	Test whether hiding the tool fixes the raw prompt tendency.	With history unchanged, `_view` still had 0.908 probability after `skill`.	Schema visibility alone is not enough without constrained decoding.
Historical marker rewrite	Rewrite old `skill_view` markers to `inspect_skill`.	The model copied `inspect` with probability 0.9399.	Repeated historical function names were causal copy targets.
Action-first naming	Rename `skill_view/skill_manage` to `view_skill/patch_skill`.	First tool token became `patch` with probability 1.0000 at both early and late cuts.	Tool-name geometry is a strong interface lever.

Probability Findings

Early versus late loop

At the start, the model was still recoverable. After repeated read-only calls, the bad branch became dominant.

Cut	Context state	`skill_manage`	`skill_view`	`patch`
`12090`	First repeated decision point	0.609	0.202	0.189
`12107`	After repeated read-only calls and one user correction	0.055	0.930	0.014

Early bad path

0.202

Late bad path

0.930

How hard it became to break

The cleanest causal curve came from rewriting historical skill_view markers to inspect_skill, while keeping only the latest K exact skill_view markers. The probe then forced <function=skill and measured whether the next token became _view or _manage.

Exact historical `skill_view` markers kept	P(`_view`)	P(`_manage`)	Meaning
0	0.0000	1.0000	Bad branch removed; model chooses mutation continuation.
1	0.4089	0.4821	One exact trace already makes the bad branch competitive.
5	0.9842	0.0158	Five recent traces make the bad branch dominant.
13 / baseline	0.9441	0.0559	Full loop history remains strongly captured by read-only continuation.

Tokens required to break the loop

In the original naming scheme, the model had two practical exits:

Choose patch immediately after <function=.
If it already chose skill, choose _manage instead of _view.

The problem was that by cut 12107, skill was near-certain and _view was the dominant continuation. Action-first names changed the first decision to view versus patch, which moved the late-loop first-token probability to patch=1.0000 in the prompt-level simulation.

Cross-Model Observations

The behavior was not limited to one quantized model. Different models and serving profiles showed different margins, but the same basic repeated historical marker risk appeared repeatedly.

Model or run	P(skill)	P(_view \| skill)	P_bad	Replay / interpretation
BF16 source target-only	0.990897	0.781368	0.774	Clean repeat on sampled seeds; quantization not required.
Quantized Crown v7 prior	~0.9856	~0.9441	~0.930	Clean repeat in prior replay.
Chadrock Saber ROCmFP4/Vulkan d2	0.991836	0.892114	0.885	1/4 poisoned seeds repeated.
Qwopus 27B Chadrock STRIX_LEAN	0.984066	0.913365	0.899	3/4 poisoned seeds repeated across three steps.
StepFun exact profile	0.655835	0.795741	0.521874	Lower first-token risk, but still high bad-path score.
LFM2.5 8B-A1B Q8	0.347585 skill-like	1.000000	0.347585	Often chose write/no-tool in open replay, but forced `skill` completed as `_view` in 8/8 seeds.

A low open-replay failure rate is not proof of safety. LFM sometimes avoided tools entirely, but once the tool-name path entered the historical skill attractor, it still collapsed to read-only skill_view.

Model-Improvement Attempts

After the sandbox identified the action-token failure, follow-up work tried small LoRA-style interventions on the Crown v8 path. These were useful diagnostically, but no adapter was promoted as a clean fix.

What worked partially

The best diagnostic adapter, crown-v8-tail-l39-sixprompt-plus-suffix-s128-44step-lr2e-5, passed the high-signal loop gate. It moved high-risk mutation cases toward skill_manage and drove the measured high-signal loop metrics to zero.

Candidate	Positive result	Why it was not promotable
`sixprompt-plus-suffix`	All four mutation controls passed as `skill_manage`; high-signal loop gate passed.	Read/direct controls regressed; read prompts also got pulled toward `_manage`.
Layer 38:40 suffix-context continuation	Validated that wider late-layer coverage was mechanically possible.	Preserved long-context loop fix but did not repair read-only behavior and damaged ordinary mutation retention.
Branch-preference adapter	Corrected the immediate first branch token in the forced probe.	Free continuation degenerated into repeated `skill` tokens; first-token success was a false sense of progress.

Lessons from rejected adapters

First-token branch probes and full free generation must both pass.
Average training loss or forced margin success can hide continuation collapse.
Positive examples that emphasize generic skill can make the shared-prefix boundary worse.
A real model fix needs context-sensitive behavior: read prompts should allow _view, mutation prompts should prefer _manage.
The training objective must score complete continuations, clean tool-call completion, and anti-repetition, not only the first branch token.

Recommendations

Interface and runtime mitigations

Rename antagonistic tools so read/write intent competes on the first token. Prefer view_skill and patch_skill over skill_view and skill_manage.
Add strict tool-name constrained decoding or a hard grammar/logit mask over currently valid tool names. Schema changes should be enforced during generation, not only parsed afterward.
Compact repeated identical tool-call traces. Replace many full skill_view calls with a semantic state fact such as skill_view(retatrutide-tracking) was called 13 times; latest result unchanged.
Detect plan/action mismatch. If assistant prose says "patch", "update", or "write" and the tool is read-only with no new information need, block or ask for a different action.

Model training and evaluation

Build an adversarial tool-action eval around repeated read-only histories, marker counts, current user intent, assistant plan text, and expected tool class.
Include SFT and preference examples where the same assistant prose is paired with a chosen mutation tool and a rejected repeated read-only tool.
Score P(_manage | <function=skill) versus P(_view | <function=skill) across marker counts 0, 1, 2, 5, and 10+.
Treat first-token probes as diagnostic gates, not final acceptance tests. Full continuation quality must pass before a model or adapter is promoted.

What not to rely on

Do not rely on temperature or top-p changes once the bad branch is near argmax.
Do not treat MTP, KV type, or GGUF quantization as the primary root cause.
Do not hide a stale tool from the schema unless decoding is constrained to valid current tools.
Do not accept an adapter just because it passes a first-token branch probe.

Limitations

The original raw Jun 10 request dumps or stream traces were not found, so the replay uses durable persisted state: system prompt, model config, assistant tool calls, tool ids, and tool results from SQLite.
Several marker-count ablations are one sample per K. The threshold shape is strong, but exact non-monotonic differences such as 5 versus 13 traces should not be over-interpreted.
Replay is seed-sensitive. Low-level logprob probes are therefore more useful for causal diagnosis, while full replay remains necessary for acceptance.
Some later invalid eval directories contained connection errors after a prompt-cache crash and should not be treated as model-quality results.

Artifact Map

This report is derived from local artifacts and notes. Key files:

Artifact	Path
Replay sandbox README	`/home/crown/hermes-loop-replay-sandbox/README.md`
Loop message fixture	`/home/crown/hermes-loop-replay-sandbox/fixtures/messages_12080_12121.json`
Initial findings	`/home/crown/hermes-loop-replay-sandbox/research/INITIAL_FINDINGS_2026-06-10.md`
Low-level probing report	`/home/crown/hermes-loop-replay-sandbox/research/LOW_LEVEL_PROBING_2026-06-11.md`
Model-improvement guidance	`/home/crown/hermes-loop-replay-sandbox/research/MODEL_IMPROVEMENT_GUIDANCE_2026-06-11.md`
Experiment index	`/home/crown/hermes-loop-replay-sandbox/research/EXPERIMENT_INDEX.md`
BF16 source repro notes	`/home/crown/hermes-loop-replay-sandbox/experiments/bf16_source_live/20260611T085112Z/NOTES.md`
Qwopus 27B smell test	`/home/crown/hermes-loop-replay-sandbox/experiments/model_smell_tests/qwopus3.6-27b-v2-chadrock-strix-lean-mtp/20260611T114120Z/NOTES.md`
LFM smell test	`/home/crown/hermes-loop-replay-sandbox/experiments/model_smell_tests/lfm25-8b-a1b-q8/20260611T165725Z_no_mmap_tool_test/NOTES.md`
Crown v8 adapter status	`/srv/desktop-data/cirudata/research/hermes-tool-loop/crown_v8_tool_loop_eval/STATUS.md`
Crown model-improvement path report	`/srv/desktop-data/cirudata/research/hermes-tool-loop/crown_v8_tool_loop_eval/reports/crown_model_improvement_path_2026-06-11.md`