Hacker Newsnew | past | comments | ask | show | jobs | submit | carterschonwald's commentslogin

while i cant speak regarding arbitrary prompt injections, ive been using a simple approach i add to any llm harness i use, that seems to solve turn or role confusion being remotely viable.

i really need to test my toolkit (carterkit) augmented harnesses on some of the more respectavle benchmarks


im pretty stoked about the llm harness theyre using. cause I wrote all the code thats not monopi code in that fork!

despite it’s paucity of features, the changes i landed in it from my design notes actually have been so smooth in terms of comparative ux/ llm behavior that its my daily driver since ive stood it up.

Previously, since early december, ive had to run a patch script on every update of claude code to make it stop undermining me. I didnt need a hilarious code leak to find the problematic strings in the minified js ;)

I regard punkin-pi as a first stab at translating ideas ive had over the past 6 months for reliable llm harnesses. I hit some walls in terms of mono pi architecture for doing much more improvement with mono pi.

so Im working on the next gen of agent harnesses! stay tuned!


yeah its honestly full of vibe fixes to vibe hacks with no overarching desig. . some great little empirical observations though!i think the only clever bit relative to my own designs is just tracking time since last cache ht to check ttl. idk why i hadnt thought of that, but makes perfect sense


this is black bar grade great. give us black bar


the main thing ive been hacking on recently is what i consider to be the first next gen llm harness, ive a demonstrator that implements like 40percent of what ive pretty complete specs for on top of mono pi. theres some pretty big differences in overall reasoning and reliability when i run most useful sota frontier models with all my pieces. early users have reported the models actually are more cozy, reliable and have a teeny bit more reasoning capacity


omg this is so cool. because im writing my own harness and i need some cognitive benchmarks. i have a bunch of harness level infra around llm interactions that seems to help with reasoning, but i dont have a structured way evaluate things

thx for sharing your test setup, i really appreciate the time you took. this will help me so much


they just released the first small models that i would consider even vaguely articulate for edge inference involving a human. maybe they want to do a mistral and raise a kajillion and work from their home town?


What does do a mistral mean?


MistralAI is known for their smaller models on the edge, to avoid competing with Gemini & OpenAI directly.


Who knows if OpenAI will do a refresh, but gpt-oss-20B/120B are still some of the best edge models so far.


oh?! what do they handle well? how do they fail?

the 3.5 9b model on my laptop at full fp8 is outlandish in its seeming reasoning capacity, though i haven’t really stress tested it


you need to merge updated tool call docs into your prompt


somehow this counts like model cot.


static linking va dynamic but we dont know the actual config and setup. and also the choice of totally changes the problem


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: