That's nice, I've had the issue where LLMs would return non-existent uids. But does this package actually help with that? Token savings are nice, but not really my main concern. If this can measurably reduce hallucinations, it would be really useful.
> Where UUIDs cost ~23 tokens and get hallucinated by LLMs, id-agent produces memorable word-based IDs at ~14 tokens with equivalent collision resistance.
My gut feeling is that the hallucinations are caused by the entropy. A UUID has unlikely character sequences. But the entropy is a core feature. Turning the UUID into words keeps the same entropy, you just have surprising words instead of surprising hex sequences.
I would be surprised if this actually helped with hallucinations. Happy to be proven wrong though, and this seems like an easy experiment to run: just take a tiny model (below 1B) and have it transcribe a couple thousand ids in both formats, then check where it made more mistakes
I had similar thoughts. The readme intro explicitly mentions hallucinations, that's why I thought I'd ask.
If you're dealing with uid in -> uid out, where you're hoping to get the same uid out, intuitively the entropy would be greatly reduced anyways. Then the question becomes, are words conducive to keeping input->output consistent, given the way LLMs work (e.g. attention mechanism)? I could see it go either way, that's why I'm supporting the idea of running your experiment.
But within the surprising words, the adjacent tokens are common. I can see an argument for having fewer transcription errors on badger-yellow-alternate than 0B9A26F3C74D.
Your test with small models makes tons of sense. Would be interesting to graph to two approaches against model size and recency.
We wrote a simple internal tool that looks at the that transcript and replaces all UUID and BSON IDs with lower cardinality placeholders (e.g., id-1), including replacing them in the output, and it instantly brought down common error and hallucination rates. I figure this tool lets you apply semantic tokens to the IDs too, e.g. user-1 instead of id-1. Stuff like this is useful for my team because we only use small, fast, highly available models for bulk classification, so we measure error and hallucinations where we can.
Okay, but you can also validate uids. What I'm asking is whether the human readable uids cause fewer hallucinations, as that would be the real win imo.
Didn't expect it to get hammered like that, just added caching for the sheets request. Thanks, my guy ;)
Backfilling it further is definitely in the cards, I just want to stabilize the methodology first.
If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest. So it's saying more about the model family than a specific version. Tbh I'll probably remove all "-latest" data points going forward, as I mentioned in another comment.
Yes! Going forward I'm definitely doing that, once there is enough data. Might even backfill the data more into the past. I just want to stabilize the methodology before burning more tokens.
And it's probably a good idea to create a list of model release dates, so older comments can't accidentally map to models that weren't released yet.
From the comments that I've checked manually it's pretty good. You can go to the "User Ratings" tab in the Google Sheet and check some comments to get an idea.
That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.
The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.
I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.
For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.
For the wording, I'd like to keep a certain amount of click bait, sorry ;)
Calling it sota might be a bit provocative, but what actually is the "state of the art"? We have benchmarks, but those are getting increasingly gamed and don't necessarily reflect the actual performance of a model, see Opus 4.7. So I think it's useful to have real world data from actual users as an additional data point.
> Where UUIDs cost ~23 tokens and get hallucinated by LLMs, id-agent produces memorable word-based IDs at ~14 tokens with equivalent collision resistance.
reply