Hacker Newsnew | past | comments | ask | show | jobs | submit | yunusabd's commentslogin

That's nice, I've had the issue where LLMs would return non-existent uids. But does this package actually help with that? Token savings are nice, but not really my main concern. If this can measurably reduce hallucinations, it would be really useful.

> Where UUIDs cost ~23 tokens and get hallucinated by LLMs, id-agent produces memorable word-based IDs at ~14 tokens with equivalent collision resistance.


My gut feeling is that the hallucinations are caused by the entropy. A UUID has unlikely character sequences. But the entropy is a core feature. Turning the UUID into words keeps the same entropy, you just have surprising words instead of surprising hex sequences.

I would be surprised if this actually helped with hallucinations. Happy to be proven wrong though, and this seems like an easy experiment to run: just take a tiny model (below 1B) and have it transcribe a couple thousand ids in both formats, then check where it made more mistakes


I had similar thoughts. The readme intro explicitly mentions hallucinations, that's why I thought I'd ask.

If you're dealing with uid in -> uid out, where you're hoping to get the same uid out, intuitively the entropy would be greatly reduced anyways. Then the question becomes, are words conducive to keeping input->output consistent, given the way LLMs work (e.g. attention mechanism)? I could see it go either way, that's why I'm supporting the idea of running your experiment.


But within the surprising words, the adjacent tokens are common. I can see an argument for having fewer transcription errors on badger-yellow-alternate than 0B9A26F3C74D.

Your test with small models makes tons of sense. Would be interesting to graph to two approaches against model size and recency.


We wrote a simple internal tool that looks at the that transcript and replaces all UUID and BSON IDs with lower cardinality placeholders (e.g., id-1), including replacing them in the output, and it instantly brought down common error and hallucination rates. I figure this tool lets you apply semantic tokens to the IDs too, e.g. user-1 instead of id-1. Stuff like this is useful for my team because we only use small, fast, highly available models for bulk classification, so we measure error and hallucinations where we can.

Yes, we have the validation methods to verify the output. https://github.com/vostride/id-agent/#validateid

A random "-" separated words will fail the validation check.


Okay, but you can also validate uids. What I'm asking is whether the human readable uids cause fewer hallucinations, as that would be the real win imo.

It seems like the right solution is around the corner: placeholders for these kinds of strings (uuid, hash, etc)

Why should an LLM even have these types of IDs anywhere in the prediction pipeline?


Didn't expect it to get hammered like that, just added caching for the sheets request. Thanks, my guy ;)

Backfilling it further is definitely in the cards, I just want to stabilize the methodology first.

If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest. So it's saying more about the model family than a specific version. Tbh I'll probably remove all "-latest" data points going forward, as I mentioned in another comment.


> If a comment just mentions Opus without being more specific and in the absence of relevant context clues, it gets mapped to Opus Latest

Consider keeping this data point but instead calling it something like "Opus Unspecified". Let the user decide how to interpret it.


you prob just want to map ALL opuses to "opus-all" or somethign - do we really care on 4.5 vs 4.6 vs 4.7, we just want to see trendline over time


There is one mention of Mimo V2.5 Pro in the data by... you! In the UserRatings tab in the sheet, if you want to have a look.

Searching for it on HN shows very few results, that's why it's not showing up in the analysis yet. But it might in the future, once it gains traction.

I'll keep an eye on it, thanks for bringing it up!

https://news.ycombinator.com/item?id=47911464


Yes! Going forward I'm definitely doing that, once there is enough data. Might even backfill the data more into the past. I just want to stabilize the methodology before burning more tokens.

And it's probably a good idea to create a list of model release dates, so older comments can't accidentally map to models that weren't released yet.


From the comments that I've checked manually it's pretty good. You can go to the "User Ratings" tab in the Google Sheet and check some comments to get an idea.


Yeah, so often people just mention "Opus" or "GPT" without a version, and those get mapped to the "-latest" suffix.

I thought I'd keep these as a rating for model families rather than specific models. But tbh it's probably better to remove them, too confusing.


That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.

The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.

I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.

For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.

For the wording, I'd like to keep a certain amount of click bait, sorry ;)


Thanks for the comment, should be fixed now.


Thanks, I replaced it with a custom graph, should be easier to read now.


Calling it sota might be a bit provocative, but what actually is the "state of the art"? We have benchmarks, but those are getting increasingly gamed and don't necessarily reflect the actual performance of a model, see Opus 4.7. So I think it's useful to have real world data from actual users as an additional data point.


Maybe you shouldn't be relying on something if you can't even tell how good it is?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: