More

carterschonwald · 2026-04-03T21:08:18 1775250498

while i cant speak regarding arbitrary prompt injections, ive been using a simple approach i add to any llm harness i use, that seems to solve turn or role confusion being remotely viable.

i really need to test my toolkit (carterkit) augmented harnesses on some of the more respectavle benchmarks

carterschonwald · 2026-04-02T20:33:32 1775162012

im pretty stoked about the llm harness theyre using. cause I wrote all the code thats not monopi code in that fork!

despite it’s paucity of features, the changes i landed in it from my design notes actually have been so smooth in terms of comparative ux/ llm behavior that its my daily driver since ive stood it up.

Previously, since early december, ive had to run a patch script on every update of claude code to make it stop undermining me. I didnt need a hilarious code leak to find the problematic strings in the minified js ;)

I regard punkin-pi as a first stab at translating ideas ive had over the past 6 months for reliable llm harnesses. I hit some walls in terms of mono pi architecture for doing much more improvement with mono pi.

so Im working on the next gen of agent harnesses! stay tuned!

carterschonwald · 2026-04-01T10:58:28 1775041108

yeah its honestly full of vibe fixes to vibe hacks with no overarching desig. . some great little empirical observations though!i think the only clever bit relative to my own designs is just tracking time since last cache ht to check ttl. idk why i hadnt thought of that, but makes perfect sense

carterschonwald · 2026-03-10T15:26:32 1773156392

this is black bar grade great. give us black bar

carterschonwald · 2026-03-09T14:15:57 1773065757

the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity

carterschonwald · 2026-03-04T15:03:51 1772636631

omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things

thx for sharing your test setup, i really appreciate the time you took. this will help me so much

carterschonwald · 2026-03-04T13:41:50 1772631710

they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?

victorbjorklund · 2026-03-04T13:50:31 1772632231

What does do a mistral mean?

goldenarm · 2026-03-04T13:59:44 1772632784

MistralAI is known for their smaller models on the edge, to avoid competing with Gemini & OpenAI directly.

mycall · 2026-03-04T14:05:15 1772633115

Who knows if OpenAI will do a refresh, but gpt-oss-20B/120B are still some of the best edge models so far.

carterschonwald · 2026-03-04T15:07:37 1772636857

oh?! what do they handle well? how do they fail?

the 3.5 9b model on my laptop at full fp8 is outlandish in its seeming reasoning capacity, though i haven’t really stress tested it

carterschonwald · 2026-03-01T12:22:03 1772367723

you need to merge updated tool call docs into your prompt

carterschonwald · 2026-02-15T14:55:02 1771167302

somehow this counts like model cot.

carterschonwald · 2026-01-30T06:07:39 1769753259

static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem