The quantitative ux research team at Google was created for exactly this problem: a service which became popular before the right metrics existed, meaning metrics need to be derived first, then optimized. We would observe users (irl), read their logs, then generate experiments to improve the behavior as measured by logs, and return to see if the experiment improves irl experiences. There were not many of us and we are around :)
I worked with Boris in the past and in my experience, Boris cares deeply about the customer. I'd vouch that Boris really cares about the issue people are running into.
The idea is that Claude Code is surprisingly buggy and unrefined for something created by the very tool and processes that are supposed to be replacing us as we speak.
Sure they can. The solution is pretty simple and in your own post. Choose either:
* Make the product good to the point code is no longer slop and shit.
* Stop hyping the quality when it isn’t there.
* Do a hybrid approach. Use their own product but actually have competent humans in the loop to make the code good.
This is not hard. Be honest and humble and that criticism goes away. It’s no one’s fault but Anthropic’s that they hype up their product to more than it can do and use it carelessly to build itself. It’s not a no-win scenario if you’re the one causing your own obviously avoidable problems.
If you mean Google website login, that step is needed because the email address is used to determine which identity provider to use. E.g. I have three different accounts that branch off from that same initial login flow.
One is my person "gmail.com" account, and the other two go through enteprise identity providers related to my employment and their G-Suite licenses. So after I put in one of these three email addresses, I get prompted for the appropriate next step. Only one of them involves giving a password to a Google server. The other two are redirects to completely separate login systems operated by my employer.
I mean I get it logically makes sense. But it still seems like a waste of time for a small percentage of use cases.
Maybe a better approach is put in your login have it automatically detect if it requires an identity provider. Gray out the password to signal to the user password is not necessary and automatically redirect.
Less clicking, don't break flow and think of a smoother solution.
Short answer: i use multiple metrics, never rely on just 1 metric.
Long answer: Is the metric for people with subject-matter knowledge? Then (Weighted)RMSSE, or the MASE alternative for a median forecast. WRMSSE is is very nice, it can deal with zeroes, is scale-invariant and symmetrical in penalizing under/over-forecasting.
The above metrics are completely uninterpretable to people outside of the forecasting sphere though. For those cases i tend to just stick with raw errors; if a percentage metric is really necessary then a Weighted MAPE/RMSE, the weighing is still graspable for most, and it doesn't explode with zeroes.
I've also been exploring FVA (Forecast Value Added), compared against a second decent forecast. FVA is very intuitive, if your base-measures are reliable at least. Aside from that i always look at forecast plots. It's tedious but they often tell you a lot that gets lost in the numbers.
RMSLE i havent used much. From what i read it looks interesting, though more for very specific scenarios (many outliers, high variance, nonlinear data?)
MAPE can be a problem also if you have a problem where rare excursions are what you want to predict and the cost of missing an event is much higher than predicting a non-event. A model that just predicts no change would have very low MAPE because most of the time nothing happens. When the event happens, however, the error of predicting status quo ante is much worse than small baseline errors.
Thanks for the reply! I am outside the forecasting sphere.
RMSLE gives proportional error (so, scale-invariant) without MAPE's systematic under-prediction bias. It does require all-positive values, for the logarithm step.
Is it, even when applied to trivial classifiers (possibly "classical" ones)?
I feel that we're wrong to be focusing so much on the conversational/inference aspect of LLMs. The way I see it, the true "magic" hides in the model itself. It's effectively a computational representation of understanding. I feel there's a lot of unrealized value hidden in the structure of the latent space itself. We need to spend more time studying it, make more diverse and hands-on tools to explore it, and mine it for all kinds of insights.
I recommend "Augustus: First Emperor of Rome" by Adrian Goldsworthy.
Given that "democratic" didn't mean the same thing then it does now (with suffrage limited to a small group of the uber-rich), and that some of the problems he was fixing was "the threat of this army I happen to have" and "this war I actively participated in", I don't think it is wrong. He wasn't in Rome when the Senate awarded him power and the Vestal Virgins drank in his name, which isn't something that would be commanded.
After decades of war and strife and food shortages, peace under one warlord looked more appealing than having three who would likely eventually be at each other's throats.
Well, first, you have to consider that the Roman republic was never really democratic but was at the hand of a small aristocracy which had sometimes but rarely the interest of the Romans at all, let alone the non citizen inhabitants.
But even, then it is definitely not accurate. Augustus gained powers through the numerous conflicts which followed Caesar murder at a time when the Republic was already challenged thanks to legions he more or less inherited (oversimplification) from Caesar. He was given powers by the Senate through what we would call rubber stamping only after his military power was inescapable.
You have to do something with your life anyway, right? I always envied people who have a calling they are good at and work on essentially until they die (especially in academia and art), since I'm not sure if I have one and if I do (designing 4x God sim games?) I'm unlikely to be paid for it even if I was good, which is itself also unlikely.
Then there's also the case where following your passion is near impossible without a large organization, anything from space to medicine.
But even forgetting all that, there is no reason engineering challenges, team dynamics and sense of accomplishment at a work project can't be higher than for the personal projects you'd do by yourself. Granted, most jobs aren't like that (for myself or for most people) but some of my most challenging and exciting projects were at work.
If you're gonna spend time until you die doing tech things you might as well get paid for it. The less you need the latter the pickier you can be, with your own thing becoming /another option/ at some point.
Disagree with the first piece about only using the top 0.1%. I grew up (through my 20's) shooting on a Pentax K1000, cheap workhorse of a camera, and I preferred its ergonomics to top-end mirrorless cameras I use today.
The K1000 is generally considered among the best film SLRs ever made, especially for the price, and easily falls into the top 0.1% category in my mind. There's a reason why it was in continuous production for 20 years with hardly any changes to its design.