We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!
For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:
We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.
We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.
Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!
This is great work by Dawn Song 's team. A huge part of botsbench.com for comparing agents & models for investigation has been in protecting against this kind of thing. As AI & agents keep getting more effective & tenacious, some of the things we've had to add protections against:
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin
Instead of scanning more code, afaict what you seem to want is instead, scan on the same small area, and compare on how many FPs are found there. A common measure here is what % of the reported issues got labeled as security issues and fixed. I don't see Mythos publishing on relative FP rate, so dunno how to compare those. Maybe something substantively changed?
At the same time, I'm not sure that really changes anything because I don't see a reason to believe attacks are constrained by the quality of source code vulnerability finding tools, at least for the last 10-15 years after open source fuzzing tools got a lot better, popular, and industrialized.
This might sound like a grumpy reply, but as someone on both sides here, it's easy to maintain two positions:
1. This stuff is great, and doing code reviews has been one of my favorite claude code use cases for a year now, including security review. It is both easier to use than traditional tools, and opens up higher-level analysis too.
2. Finding bugs in source code was sufficiently cheap already for attackers. They don't need the ease of use or high-level thing in practice, there's enough tooling out there that makes enough of these. Likewise, groups have already industrialized.
There's an element of vuln-pocalypse that may be coming with the ease of use going further than already happening with existing out-of-the-box blackbox & source code scanning tools . That's not really what I worry about though.
Scarier to me, instead, is what this does to today's reliance on human response. AI rapidly industrializes what how attackers escalate access and wedge in once they're in. Even without AI, that's been getting faster and more comprehensive, and with AI, the higher-level orchestration can get much more aggressive for much less capable people. So the steady stream of existing vulns & takeovers into much more industrialized escalations is what worries me more. As coordination keeps moving into machine speed, the current reliance on human response is becoming less and less of an option.
We find it true in Louie.ai evals (ai for investigations), about a 10-20% lift which meaningful. It'd measured here: botsbench.com .
Unfortunately, undesirable in practice due to people being token-constrained even before. One case is retrying only on failure, but even that is a bit tricky...
I've found value in architectural research before r&d tier projects like big changes to gfql, our oss gpu cypher implementation. It ends up multistage:
- deep research for papers, projects etc. I prefer ChatGPT Pro Deep Research here As it can quickly survey hundreds of sources for overall relevance
- deep dives into specific papers and projects, where an AI coding agent downloads relevant papers and projects for local analysis loops, performs technical breakdowns into essentially a markdown wiki, and then reduces over all of them into a findings report. Claude code is a bit nicer here because it supports parallel subagents well.
- iterative design phase where the agent iterates between the papers repos and our own project to refine suggestions and ideas
Fundamentally, this is both exciting, but also limiting: It's an example of 'Software Collapse' where we get to ensure best practices and good ideas from relevant communities, but the LLM is not doing the creativity here, just mashing up and helping pick.
Tools to automate the stuff seems nice. I'd expect it to be trained into the agents soon as it's not far from their existing capabilities already. Eg, 'iteratively optimize function foobar, prefer GPU literature for how.'
Quality indie software in a niche that Ikea is not addressing can make a decent income unlike a lemonade stand.
And unlike at (this hypothetical) Ikea, you wouldn't have to maintain the impression of 20x AI-augmented output to avoid being fired. Well, you could still use AI as much as you want, but you wouldn't have to keep proving you're not underusing it.
Evals let us agree on the baseline, measurement, etc, and compare if simple things others do perform just as well. For same reason, instead of 'works on my box' and 'my coding style', use one of the many community evals vs making up your own benchmark.
That helps head off much of many of the unfalsifiable discussions & claims happening and moves everyone forward.
a rust version of that compiler (that the project runs on) ran at 480k claims/sec and it was able to deterministically resolve 83% of conflicts across 1 million concurrent agents (also 393,275x compression reduction @ 1m agents on input vs output, but different topics can make the compression vary)
natively claude (and other LLM) will resolve conflicting claims at about 51% rate (based on internal research)
the built in byzantine fault tolerance (again, in the compiler) is also pretty remarkable, it can correctly find the right answer even if 93% of the agents/data are malicious (with only 7% of agents/data telling us the correct information)
basically the idea here is if you want to build autonomous at scale, you need to be able to resolve disagreement at scale and this project does a pretty nice job at doing that
My question was on claims like "5x productivity boost in merged PRs (lots of open PR & merge rate goes down, but net positive)", eg, does this change anything on swe-bench or any other standard coding eval?
The ecosystem is 8 tools plus a claude code plugin, the unlock was composing those tools (I don't regularly use all 9). The 5x claim was from /insights (claude code)
Not for everyone, but it radically changed how I build. Senior engineer, 10+ years
Now it's trivial to run multiple projects in parallel across claude sessions (this was not really manageable before using wheat)
Genuinely don't remember the last time I opened a file locally
It sounds like the answer is "No, there is no repeatable eval of the core AI coding productivity claim, definitely not on one of the many AI coding benchmarks in the community used for understanding & comparison, and there will not be"
Maybe there's a fundamental miscommunication here of what evals are?
Evals apply not just to LLMs but to skills, prompts, tools, and most things changing the behavior of compound AI systems, and especially like the productivity claims being put forth in this thread.
The features in the post relate directly to heavily researched areas of agents that are regularly benchmarked and evaluated. They're not obscure, eg, another recent HN frontpage item benchmarked on research and planning.
your question makes sense, it's just not in current scope
we are still benchmarking the compiler at scale and the LLM tools that were made were created as functional prototypes to showcase a single example of the compiler's use case
since much of the unlock here is finding different applications for the compiler itself, we simply don't have the bandwidth to do much benchmarking on these projects on top of maintaining the repos themselves
all the code is open source and there is nothing stopping anyone from running their own benchmarks if they were curious
We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!
For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:
- overall pipelines: https://pygraphistry.readthedocs.io/en/latest/gfql/benchmark...
- declarative visual encodings as simple calls: https://pygraphistry.readthedocs.io/en/latest/gfql/builtin_c...
reply