Maybe there's a fundamental miscommunication here of what evals are?
Evals apply not just to LLMs but to skills, prompts, tools, and most things changing the behavior of compound AI systems, and especially like the productivity claims being put forth in this thread.
The features in the post relate directly to heavily researched areas of agents that are regularly benchmarked and evaluated. They're not obscure, eg, another recent HN frontpage item benchmarked on research and planning.
your question makes sense, it's just not in current scope
we are still benchmarking the compiler at scale and the LLM tools that were made were created as functional prototypes to showcase a single example of the compiler's use case
since much of the unlock here is finding different applications for the compiler itself, we simply don't have the bandwidth to do much benchmarking on these projects on top of maintaining the repos themselves
all the code is open source and there is nothing stopping anyone from running their own benchmarks if they were curious
Evals apply not just to LLMs but to skills, prompts, tools, and most things changing the behavior of compound AI systems, and especially like the productivity claims being put forth in this thread.
The features in the post relate directly to heavily researched areas of agents that are regularly benchmarked and evaluated. They're not obscure, eg, another recent HN frontpage item benchmarked on research and planning.