while i cant speak regarding arbitrary prompt injections, ive been using a simple approach i add to any llm harness i use, that seems to solve turn or role confusion being remotely viable.
i really need to test my toolkit (carterkit) augmented harnesses on some of the more respectavle benchmarks
im pretty stoked about the llm harness theyre using. cause I wrote all the code thats not monopi code in that fork!
despite it’s paucity of features, the changes i landed in it from my design notes actually have been so smooth in terms of comparative ux/ llm behavior that its my daily driver since ive stood it up.
Previously, since early december, ive had to run a patch script on every update of claude code to make it stop undermining me. I didnt need a hilarious code leak to find the problematic strings in the minified js ;)
I regard punkin-pi as a first stab at translating ideas ive had over the past 6 months for reliable llm harnesses. I hit some walls in terms of mono pi architecture for doing much more improvement with mono pi.
so Im working on the next gen of agent harnesses! stay tuned!
yeah its honestly full of vibe fixes to vibe hacks with no overarching desig. . some great little empirical observations though!i think the only clever bit relative to my own designs is just tracking time since last cache ht to check ttl. idk why i hadnt thought of that, but makes perfect sense
the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity
omg this is so cool.
because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things
thx for sharing your test setup, i really appreciate the time you took. this will help me so much
they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?
i really need to test my toolkit (carterkit) augmented harnesses on some of the more respectavle benchmarks