Not misunderstanding. And I had assumed what you described at first as well.
All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
Then why did you write "Also, it’s super easy to game. Insert random lags, reduce tokens/sec, there you have a model that maintains attention over “long-time horizons”"?
The wall-clock time the LLM spends per task isn't the metric. How long you can leave the LLM alone, wall-clock time, without intervention, isn't "long-time horizons", it's more like "I gave it a list of tasks and it worked through them". Which is neat when it works, but different.
> All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Yes? And? The long time horizons is with reference *to how long it would take humans to do*. Of course this is celebrated. When I've experimented with them, quite often after finishing one task from the plan, they'll go right on to the next task. Each task may take minutes, but the plan can have hundreds of items in it, and hundreds of minute-by-the-clock tasks is indeed hours.
You're literally, on your opening sentence, complaining about 2 + 2 taking longer to solve, this isn't even close to the point of the "time horizons" metric.
> How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
Mostly it wasn't estimated, but rather *measured*:
2.2 Baselining
In order to ground AI agent performance, we also measure the performance of multiple human “baseliners” on most tasks and recorded the duration of their attempts. In total, we use over 800 baselines totaling 2,529 hours, of which 558 baselines (286 successful) come from HCAST and RE-Bench, and 249 (236 successful) from the shorter SWAA tasks. 148 of the 169 tasks have human baselines, but we rely on researcher estimates for 21 tasks in HCAST.
Our baseliners are skilled professionals in software engineering, machine learning, and cybersecurity, with the majority having attended world top-100 universities. They have an average of about 5 years of relevant experience, with software engineering baseliners having more experience than ML or cybersecurity baseliners. For more details about baselines, see Appendix C.1.
As with all the other metrics, this is now basically saturated, as nobody seems to want to pay METR $4M to hire a statistically significant number of engineers to spend 4h-1w on each of another 800 baselines for longer tasks. Or if they are, it's being kept very quiet.
If you do sprint planning, you can figure out which tasks have a short enough time horizon that the AI are competent enough to actually do the task correctly, and break down the tasks which they can't do reliably. If this sentence is confusing, see my answer to you in the other thread.
Sprints also gives you insight into when development velocity (per token rather than per time, though for both humans and AI this works out as money) slows down from technical debt.
Or at least it can in principle; I think the speed of AI is likely to make management less of an art and more of a science, with things like "technical debt" and "task estimation" going from gut feeling to something quantifiable, and this may in turn end up replacing Agile (and all the others) with something new.
You may be attributing way too much to what you are doing. And that will make it hard to accept the inevitable negative chance outcomes that will be entirely out of your control.
I know parents whose first kid slept through the night at 3 months, and their second one not sleeping through the night at age 3. Skill issue? I don’t think so. And these people are such routine enforcers that they described themselves as “stubborn.”
And then there is sickness. Amount of sun and physical activity the child gets during the day, which will depend on geography and the kid’s personality.
Our 6 year old daughter sits down, and does a ton of art. Her 2 year old sister runs laps around the house for fun. Her favorite activity is running and slamming herself to the couch. Do you think these kids get similar physical activity? What if I told you they go to sleep around the same time and have no trouble waking up?
Edit: Forgot to mention night terrors. Doctor told us about it for the first one. Had no idea what he meant, and didn’t even care to look it up because it didn’t happen. Until the 2nd one hit 15 months or so. Imagine a barely 1 year old in an extremely confused state while asleep, sitting in her bed, screaming, sometimes hitting her head on the sides of the bed, getting more agitated if you pick her up. I read that it can last up to 30 minutes. Thank god ours were no longer than 5 minutes. It’s horrific when it happens for the first time. Straight out of the Exorcist movie.
I remember talking to a parent recently and their kid didn't sleep for a long time. It turned out to be some undiagnosed something-or-other (allergy?). Can't recall the specifics but sleeping issues cleared up for the kid - and parents - when it was treated.
sorry can't remember the specifics, I'm sure i will recall the moment the edit link disappears on this post... :)
What could a 6-24 month old possibly do from their bed in their room, to disturb your sleep in your bed in your room? Bring a trumpet to bed and badly play Miles Davis?
What happened to lights off, door closed, do whatever you want in complete darkness in the bed that you aren’t able to climb out of?
I might have too French of an attitude towards parenting for American taste, but as long as the crying and screaming isn’t based on anything real (and as long as you’ve childproofed the children’s room well enough it shouldn’t be) the child will be fine and what y’all need is sufficient distance between the bedrooms, some nice, solid brick walls in between the the rooms and some earplugs.
What is real? Is poop in the diaper real enough to cry for? Is fever real enough? How do you know whether the kid is not having a fever seizure or crying because of poop in diaper without going in there? How about night terrors where the kid is throwing themselves against their crib?
The fever seizure happened to a friend’s kid. Ambulance showing up, etc. Seizures can cause permanent damage. This kid wasn’t premature and he has no other obvious differences than our kids. Am I supposed to believe/pray it can’t happen to me and just sleep?
I like sleep, but I don’t like it that much. What’s the point of having a kid if I am supposed to ignore their needs? Make sure they pay into social security so I can retire or smtg?
The key phrase from the article is "review content filmed on its smart glasses when people shared it with Meta AI". I take that to mean the user took some action to actively share the footage with Meta (although knowing Meta, that could also mean they just didn't opt out)
I’m not being facetious when I say: are they that slow or really suffering from Messiah Complex?
I have no problem that they are doing what they’re doing. Someone was going to do it. But to be so oblivious to it is a problem. One would argue that it’s a national security problem.
There is no need for a new name. It’s called a high-impact change. As opposed to a low-impact change, where one changes or adds the least number of lines necessary to achieve the goal.
Not surprised to see this, since once again, because some of us didn’t like history as a subject, lines of code is a performance measure, like a pissing contest.
All I see now is celebration of how agents run for hours and handle “long-time horizons.”
Although the original definition is also flawed for coding. How do you estimate the time it takes to complete a coding task in hours? If we had that formula, why have we been playing estimation poker or resorting to fibonacci series for predicting software tasks? Because you can’t. It’s a made up metric.
reply