LLMs work OK for "Mostly iterative and mostly one-off" tasks like codegen, where you can effectively "review the result into existence", and that's where most of the buzz is at the moment.
Where they don't work at all well is for hands-off repeatable tasks that have to be correct each time. If you ask a LLM for advice, it will tell you that you need to bound such tasks with deterministic input contract and a deterministic output contract, and then externally validate the output for correctness. if need to do that you can probably do the whole thing old-skool with not much more effort, especially if you use a LLM to help gen the code, as above. That's not a criticism of LLMs, it's just a consequence of the way they work.
They are also prone to the most massive brain farts even in areas like coding - I asked a LLM to look for issues in some heavily multithreaded code. Its "High priority fix" for a infrequently used slow path that checked for uniqueness under a lock before creating an object was to replace that and take out a read lock, copy the entire data structure under the lock, drop the lock, check for uniqueness outside of any lock, then take a write lock and insert the new object. Of course as soon as I told it it was a dumbass it instantly agreed, but if I'd told it to JFDI its suggestions it would have changed correct code into badly broken code.
Like anything else that's new in the IT world, a useful tool that's over-hyped as sweeping awsy everything that came before it and that's gleefully jumped on by PHBs as a reason to get rid of those annoying humans. Things will settle down eventually and it will find its place. I'm just thankful I'm in the run up (down?) to retirement ;-)
He's also missed a major step, which is to feed your skill into the LLM and ask it to critique it - after all, it's the LLM that's going to act on it, so asking it to assess first is kinda important. I've done that for his skills, here's the assessment:
==========
Bottom line
Against the agentskills.io guidance, they look more like workflow specs than polished agent skills.
The largest gap is not correctness. It is skill design discipline:
# stronger descriptions,
# lighter defaults,
# less mandatory process,
# better degraded-mode handling,
# clearer evidence that the skills were refined through trigger/output evals.
Skill Score/10
write-a-prd 5.4
prd-to-issues 6.8
issues-to-tasks 6.0
code-review 7.6
final-audit 6.3
==========
LLM metaprogramming is extremely important, I've just finished a LLM-assisted design doc authoring session where the recommendations of the LLM are "Don't use a LLM for that part, it won't be reliable enough".
> "Don't use a LLM for that part, it won't be reliable enough".
You should now ask if the LLM is reliable enough when it says that.
Jokes aside, how is this a major step he is missing? He is using those skills to be more efficient. How important is going against agentskills.io guidance?
Because he's asking the LLM to interpret those instructions to drive his process. If the skills are poorly defined or incomplete then the process will be as well, and the LLM may misinterpret, choose to ignore, or add its own parts.
Skills are just another kind of programming, albeit at a pretty abstract level. A good initial review process for a Skill is to ask the LLM what it thinks the Skill means and where it thinks there are holes. Just writing it and then running it isn't sufficient.
Another tip is to give the Skill the same input in multiple new sessions - to stop state carryover - collect the output from each session and then feed it back into the LLM and ask it to assess where and why the output was different.
Oh dear, I thought you were merely sarcastic in your first comment. But you seem to have been fully converted to the LLM-religion, and actually believe they actually "think" or "know" anything?
People have applied "think" to the actions of software for decades. Of course it LLM's don't "think" in the human sense, but "What the output of the model indicates in an approximate way about its current internal state" is a bit long winded...
Maybe people who dont understand technology did, I can see that - my granpa also thought the computer was thinking when the windows hourglass showed up. Today maybe its the case again with the folks who dont know anything about it - you know that meme - ChatGPT always gives me correct answers for the domains I am not an expert in!
"Not in the slightest" is an overreach, the paper the second level down from that link doesn't really support the conclusion in the blog post - the paper is much more nuanced.
Are they going to fib to you sometimes? Yes of course, but that doesn't mean there's no value in behavioural metaqueries.
Like most new tech, the discussion tends to polarise into "Best thing evah!" and "Utter shite!" The truth is somewhere in between.
It's nothing like "most new tech".
Most new tech tends to be adopted early by young people and experienced techies. In this case it is mostly the opposite: The teens absolutely hate it, probably because the shitty AI content does not inspire the young mind, and the experienced techies see it for what it is. I've never seen such "new tech" which was cheered on by the proverbial average "boomers" (i.e. old people doing "office jobs", not the literal age bracket) and despised by the young folks and experienced experts of all ages.
Judging from Claude Code and the sheer number of “Make Your Favorite Anime Crush Into An AI” SaaSes on the market, I’d posit that both the young and experienced are quite enthusiastic about the new tech.
No mate, this tech is marketed as superintelligence. Nation of PhDs in a datacentet. Yadda,yadda,yadda. No in-betweens please. Why is it not delivering after so many years and hundreds of billions in investment?
Name me a new bit of tech that hasn't been hyped beyond reasonable bounds. And yes, this is one of the worst examples. But saying it doesn't have its uses isn't reasonable either.
None was hyped like this ever before. What are you talking about? Mac was about "it just works" (and it f*ing did), iPhone was "a phone, an iPod and Internet access device". Need more? Microsoft Excel - actually more powerful if you know the tool compared to the bullshit machine. C#, the programming language: "Java done right". And it bloody was! What is in common: None of these techs were hyped beyond reasonable doubt. They were hyped a bit, but not to the level of bullshit LLMs. And none of these techs claimed to do incredible stuff only to underdeliver. After so much money burnt, yes I want to see that nation of PhDs. I want to see AI "writing all the code" in six months (Anthropic claimed this in January this year). Enough of bullshit and people being told they are stupid for not knowing how to win the lottery system and comparing lottery systems. Show me the superintelligence or shut the f. up.
Do these scores actually mean anything? Isn’t the LLM just making up something? If you ran the exact same prompt through 10 times would you get those same scores every single time?
Yes I'd be interested in that answer too - these scores are most likely just generated in an arbitrary way, given how LLMs work. Given how they work in generating text it didn't actually keep a score and add to it each time it found a plus point in the skill as a human might in evaluating something.
At this point I'd discount most advice given by people using LLMs, because most of them don't recognise the inadequacies and failure modes of these machines (like the OP here) and just assume that because output is superficially convincing it is correct and based on something.
Do these skills meaningfully improve performance? Should we even need them when interacting with LLMs?
They aren't arbitrary, as I said earlier I got the LLM to de a detailed analysis first, then summarise. If I was doing this "properly" for something I was doing myself I'd go through the LLM summary point by point and challenge anything I didn't think was right and fix things in the skill where I thought it was correct.
You aren't going to have much success with LLMs if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).
And yes, Skills *do* make a significant difference to performance, in exactly the same way that well written prompts do - because that's all they really are. If you just throw something at a LLM and tell it "do something with this" it will, but it probably won't be what you want and it will probably be different each time you ask.
It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
I found the summary above devoid of useful advice, what did you see as useful advice in it?
> if you don't understand that their primary goal is to produce plausible and coherent responses rather than ones that are necessarily correct (although they may be - hopefully).
If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
> It would be interesting to see one of these evals and how it generated the score, to work out whether it is in fact arbitrary or based on some scale of points.
So go repeat the exercise yourself. I've already said this was a short-enough-to-post rollup of a much longer LLM assessment of the skills and that while most of the points were fair, some were questionable. If you were doing this "for real" you'd need to assess the full response point-by-point and decide which ones were valid.
> If you really believe this you should perhaps re-evaluate the trust you appear to place in the conclusions of LLMs, particularly about their own workings and what makes a good skill or prompt for them.
What on earth are you on about? The whole point of of the sentence you were replying to was that you can't blindly trust what comes out of them.
I'm saying that your agreement that they produce plausible but sometimes false text is contradicted by the trust you seem to have in their output and self-analysis, which is plausible but unlikely to be correct.
Yes of course there's a risk it may still be incorrect but querying the LLM with the limited facilities it provides for introspection is more likely to have at least some connection with facts than the alternative that some people use, which is to simply guess as to why it produced the output it did.
If you have an alternative approach, please share.
No of course you wouldn't because LLMs are nondeterministic. But the scores would likely be in the same ballpark. The scores I posted are the result of a much more detailed analysis done by the LLM, which was far too long to post. I eyeballed it, most of the points seemed fair so I asked it to summarise and convert into scores.
There is no evidence of this. Evals are quite different from "self-evals". The only robust way of determining if LLM instructions are "good" is to run them through the intended model lots of times and see if you consistently get the result you want. Asking the model if the instructions are good shows a very deep misunderstanding of how LLMs work.
When you give prompt P to model M, when your goal is for the model to actually execute those instructions, the model will be in state S.
When you give the same prompt to the same model, when your goal is for the model to introspect on those instructions, the model is still in state S. It's the exact same input, and therefore the exact same model state as the starting point.
Introspection-mode state only diverges from execution-mode state at the point at which you subsequently give it an introspection command.
At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input, and there is overwhelming evidence that frontier models do this very well, and have for some time.
Asking the model, while it's in state S, to introspect and surface any points of confusion or ambiguities it's experiencing about what it's being asked to do, is an extremely valuable part of the prompt engineering toolkit.
I didn't, and don't, assert that "asking the model if the instructions are good" is a replacement for evals – that's a strawman argument you seem to be constructing on your own and misattributing to me.
At that point, asking the model to e.g. note any ambiguities about the task at hand is exactly equivalent to asking it to evaluate any input
This point is load-bearing for your position, and it is completely wrong.
Prompt P at state S leads to a new state SP'. The "common jumping off point" you describe is effectively useless, because we instantly diverge from it by using different prompts.
And even if it weren't useless for that reason, LLMs don't "query" their "state" in the way that humans reflect on their state of mind.
The idea that hallucinations are somehow less likely because you're asking meta-questions about LLM output is completely without basis
Nicely put. I haven't seen anyone say that the introspection abilities of LLMs are up to much, but claiming that it's completely impossible to get a glimpse behind the curtain is untrue.
Is that based on your "deep understanding" of how LLMs work or have you actually tried it? If you watch the execution trace of a Skill in action, you can see that it's doing exactly this inspection when the skill runs - how could it possibly work any other way?
Skills are just textual instructions, LLMs are perfectly capable of spotting inconsistencies, gaps and contradictions in them. Is that sufficient to create a good skill? No, of course not, you need to actually test them. To use an analogy, asking a LLM to critique a skill is like running lint on C code first to pick up egregious problems, running testcases is vital.
You gotta love the randomly assigned score, like if LLM is actually able to measure anything. But then again, we now call a blob of text a "skill", so I guess it matches the overall bullshit pattern.
What does this even mean? It looks like typical LLM bloviation to me: 'skill design discipline', 'stronger descriptions' and 'lighter defaults'??!? This is meaningless pablum masquerading as advice.
What specifically would this cause you to actually do to improve the skills in question? How would you measure that improvement in a non hand-wavy way? What do these scores mean and how were they calculated?
Or perhaps you would ask your LLM how it would improve these skills? It will of course some up with some changes, but are they the right changes and how would you know?
Great points, but I imagine it's a bit too heavy on the rigorousness requirement for the LLM crowd. The folks are high on this stuff and I am beginning to notice it's like trying to get a heavy pothead or crackhead of off their stuff. Don't you see it - if you just wave your hands a lot, and tell the LLM to be serious about it, the scores will just appear :) It's true in their own frame of reference.
I'm not going to repeat myself, I've already explained the context to you - funny how you seem to have ignored that. If you want to find out, do the experiment yourself.
It’s all vibes based, we are not trying to be scientific here. /s
I discard most LLM advice and skills because either a script is better (as the work is routine enough) or it could be expressed better with bullet points (generating tickets).
Go even further, and add this into the skill-creator skill, and let the agent improve the skill regularly. I do this with determinism, and have my skills try to identify steps which can be scripted.
A timely link - I've just spent the last week failing to get a ChatGPT Skill to produce a reproducible management reporting workflow. I've figured out why and this article pretty much confirms my conclusions about the strengths & weaknesses of "pure" LLMS, and how to work around them. This article is for a slightly different problem domain, but the general problems and architecture needed to address them seem very similar.
"SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6"
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
There's no shortage of benchmarks (coding or otherwise) that any competent coding model will now pass with ~100%.
But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements
So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.
(whether the thing they're measuring is something that well correlates to real coding performance is another question)
So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome
Think of it less like a test suite and more like an exam. If you're trying to differentiate between the performance of different people/systems/models, you need to calibrate the difficulty accordingly.
When designing a benchmark, a pass rate of roughly 50% is useful because it gives you the most information about the relative performance of different models. If the pass rate is 90%+ too often, that means the test is too easy: you're wasting questions asking the model to do things we already know it can do, and getting no extra information. And if it's too low then you're wasting questions at the other end, trying to make it do impossible tasks.
Things have moved on since 1994, not only can you still embed it in C and a load of other languages, you can even run it directly in your browser as there's a WASM port.
I would formulate it as: The software development lifecycle is inevitable, or you will not have any software. The lifecycle is just not acknowledged and thus implicit to many people. If you hack in Notepad, FTP it to your webserver, then your lifecycle lasts till you switch it all off. A simple lifecycle, but unavoidable to have one.
You are using mutexes, they are on the Actor message queues, amongst other places. "Just use mutexes" suggests a lack of experience of using them, they are very difficult to get both correct and scalable. By keeping them inside the Actor system, a lot of complexity is removed from the layers above. Actors are not always the right choice, but when they are they are a very useful and simplifying abstraction.
I've written a non-distributed app that uses the Actor model and it's been very successful. It concurrently collects data from hundreds of REST endpoints, a typical run may make 500,000 REST requests, with 250 actors making simultaneous requests - I've tested with 1,000 but that tends to pound the REST servers into the ground. Any failed requests are re-queued. The requests aren't independent, request type C may depend on request types A & B being completed first as it requires data from them, so there's a declarative dependency graph mechanism that does the scheduling.
I started off using Akka but then the license changed and Pekko wasn't a thing yet, so I wrote my own single-process minimalist Actor framework - I only needed message queues, actor pools & supervision to handle scheduling and request failures, so that's all I wrote. It can easily handle 1m messages a second.
I have no idea why that's a "huge dead end", Actors are a model that's a very close fit to my use case, why on earth wouldn't I use it? That "nurseries" link is way TL;DR but it appears to be rubbishing other options in order to promote its particular model. The level of concurrency it provides seems to be very limited and some of it is just plain wrong - "in most concurrency systems, unhandled errors in background tasks are simply discarded". Err, no.
Big Rule 0: No Dogmas: Use The Right Tool For The Job.
Is there a similarly short/simple solution not using all of the built ins? Haven't worked with prolog in a while but should be easy enough with primitives (albeit with more duplication)?
He hates on C++ pretty much the same as he does on Rust. Your argument seems to be that Rust is better than C++, which is akin to trying to make the case that Cholera is better than Smallpox.
Language wars are boring and pointless, they all have areas of suckage. The right approach is to pick whichever one is the least worst for the job at hand.
Where they don't work at all well is for hands-off repeatable tasks that have to be correct each time. If you ask a LLM for advice, it will tell you that you need to bound such tasks with deterministic input contract and a deterministic output contract, and then externally validate the output for correctness. if need to do that you can probably do the whole thing old-skool with not much more effort, especially if you use a LLM to help gen the code, as above. That's not a criticism of LLMs, it's just a consequence of the way they work.
They are also prone to the most massive brain farts even in areas like coding - I asked a LLM to look for issues in some heavily multithreaded code. Its "High priority fix" for a infrequently used slow path that checked for uniqueness under a lock before creating an object was to replace that and take out a read lock, copy the entire data structure under the lock, drop the lock, check for uniqueness outside of any lock, then take a write lock and insert the new object. Of course as soon as I told it it was a dumbass it instantly agreed, but if I'd told it to JFDI its suggestions it would have changed correct code into badly broken code.
Like anything else that's new in the IT world, a useful tool that's over-hyped as sweeping awsy everything that came before it and that's gleefully jumped on by PHBs as a reason to get rid of those annoying humans. Things will settle down eventually and it will find its place. I'm just thankful I'm in the run up (down?) to retirement ;-)