I think the fair side of this is that you have to make tradeoffs when you design things. Scaling problems are design problems, but whether they were mistakes or not really depends on how predictable that scaling was.
Car analogies are typical, so I'll add in there.
My car can take the four of us, and we can load it up with things from the shops. I can put a bunch of heavy tins of food in there, or some DIY things, but if I put several tons of stones in the boot it'll totally fuck it up.
Is that a design problem?
Not really, it's a relatively cheap regular car, and it failed at a certain scale.
It would be a design problem if it were a flatbed truck, despite it being the same scaling that showed the problem.
Making my car resilient enough to take that weight would require tradeoffs that would either make it worse for other jobs I want it to do or at least add significantly to the cost.
This is similar in engineering software systems too, you can make it handle scaling up better, but this can require a much more complex architecture that may make it slower at smaller scales. It can make it more complicated to work with, add additional risks of failure as well.
Isn't this the massive problem? You're trying to do everything, and you can't, and you're trying to do it for everyone all at once, and have tied it all together so much that scaling up gets worse. If it's more than twice as hard to cope with twice the use, then you have to charge a bunch more to customers as you grow - and that's for your customers to get no actual benefit.
> GitHub is not perfect but I don't think it's "degraded faster" at all. It's _grown_ faster.
The experience has degraded. It's really, really bad. I've seen companies spending thousands and thousands of dollars weekly in developer time *hitting rerun on broken actions*. It's so expensive to start with then so expensive in how awful it is to use.
Something I really don't get I guess is what out of all of this actually needs to be cross-project. How much of my github use needs access to something that isn't running on the same machine? I worked with a team building things actively, maybe 20 devs? That's not really a large set of users. Let's say 10 devs with the workload of 20, the cheapest plan would be $40/mo, enterprise would be ~$200. Would ten heavy users really max out a 64GB ram, 6+8 core new i5 with dual nvme drives, a gigabit connection and unlimited traffic? That's about $40 at hetzner for a box.
I'm not arguing a big federated position, I just don't really get why some of these enormous companies need to be so centralised. It feels like the problem is trying to be a big interlinked thing, and failing at it. The only things I can think of are
1. Links between issues
2. Accounts
3. Search
The first is mostly solved with literally just links, accounts isn't a huge problem and search is fair enough - but search is utterly awful and I cannot find things within one single repo or organisation reliably. So global stuff is irrelevant.
> And it's had to expand into the AI field, which is not an incremental thing like "hey let's launch a new feature or better dashboards." Nobody knows what AI wants to be when it grows up
If github persists in being utterly shit for developers, it won't be around to find out. I'm not sure at all what part of the AI stuff needs to make everything else bad, and I'm extremely bullish on AI and agentic coding.
To really hammer this last point home, as agentic coding means we can do a lot more and faster - the unreliability of github has become much more apparent and impactful. Unreliable tests, unreliable code pulling and pushing, unreliable diffs. You're making the agents jobs harder, making the devs jobs harder exactly in the place they now spend much more time.
It makes github dramatically more expensive as a place to work. Also just really fucking annoying.
Buying hardware is paying a "random corporation". Make the massive hardware purchase after finding out if you have enough demand to buy rather than rent,
And even less than someone who wrote an interpreter for the script, less than someone who also chanted times tables while doing it.
More thinking isn’t a simple good thing. Given a limit to how much thought I can give any specific task, adding extra work may mean less where it’s most useful.
It is a good faith argument, my point is exactly that the actual scripting was not part of the relevant thought any more than the interpreter would have been.
It depends if the interesting part of the solution is the website for you. Maybe it is and that’s fine but for others it isn’t. Maybe they’ve got a cool backend thing and the ui isn’t the key part.
If it helps compare, you might have a full desire to manage a tricky server and all the various parts of it. It’d be removing the fun to just put a site on GitHub pages rather than hosting it on a pdp11. But if you want to show off your demo scene work you wouldn’t feel like you’d missed out on the fun just putting things up on a regular site.
I had a look and knew they seemed to be about £15 here, I couldn’t easily find second hand ones in the uk (though they’re not uncommon at shops). For £40 I can get a 7.5 inch black and white screen setup (trmnl byod xaio https://www.aliexpress.com/item/1005009532501677.html)
Lots of the tags I see though do have Bluetooth or maybe WiFi for updating as well.
I do really like eink things, I want to setup a nice 13 inch one which is now more like £160 so becoming more realistic for my to buy for fun.
I’m going to have to look more into these tags because if there’s cheap second hand ones they’d be awesome.
The other explanation is that often these are just mistakes that occur with a team of experts in their field but not data management, without a budget for building a more robust system, manually doing a lot of things with data. It's so easy to copy and paste something into the wrong place, to sort by a field and get things out of order, all kinds of issues like that.
On the other hand, any time a hypothesis appears significant, the first reaction should be to verify that all the data going into the calculation is correct, rather than just assume it is. In my day-to-day industry experience, significant results come far more often from incorrect data than an actual discovery.
CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".
I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before
* It's not standardised
* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.
* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.
* Errors are not even detectable in many cases.
* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.
* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.
* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.
* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.
This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?
This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.
There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.
While I appreciate a good rant just as much as the next person, most of these points have nothing to do with CSV. They are a general problem with underspecifying data, which is exactly what happens when you move data between systems.
The amount of hours I have wasted on unifying character sets across single database tables is horrifying to even think about. And the months it took before an important national dataset that supposedly many people use across several types of businesses was staggering. That fact that that XML came with a DTD was apparently not a hindrance to doing unspeakable horrors with both attributes and cdata constructs.
Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it? And that's exactly what happens in the real world when people move data across systems. That's why mojibake is still a thing in 2026.
I disagree, they are absolutely related to CSV in that these are all problems CSV has. Other formats can have these problems, but CSV is almost uniquely bad because these issues compound and it has a lot of them.
> They are a general problem with underspecifying data,
Which CSV provides essentially no tools to solve, unlike many other formats.
Also, several of these problems are not even about underspecified data but the format itself - you can have totally fine data which gets utterly fucked to the point of not parsing as a csv file by minor changes.
It's not even a fully specified format! Someone adds a comma in a field and then one of the following happens:
* Something generating the csv doesn't add quotes
* Something reading the csv doesn't understand quotes
And the classic
* Something sorted the file
> Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it?
If you've got something with actual date types you can have interfaces show actual calendars, and for many formats you will at least get an error if it's defined as DD/MM/YY and someone puts in 01/13/26. CSV however gives you no ability to do this - all data is just strings. And string defined dates with no restrictions are why I have had to deal with mixtures of 01/13/26 and 13/01/26, meaning everything goes just fine until you try and parse it. Or, like some of my personal favourites, "Winter 2019".
CSV is not one format, lacks verification of any useful kind, is almost uniquely easy for users to completely fuck up, and the lack of types means that programs do their own type inference which adds to things getting messed up.
You're blaming a lot of normal ETL problems on DSVs.
Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.
There are types and there is a spec WHEN YOU DEFINE IT.
You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.
DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.
We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.
You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.
Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.
> so you need a heavy unreadable format
I prefer human unreadable if it means I get machine readable without any guesswork.
No, it's Excel trying to be too clever. It does the same thing with manual imput if you don't proactively change the field type.
You can import a DSV into Excel without mangling datatypes in a few different ways. Probably the best way is using Power Query.
A DSV generally does have a schema. It's just not in the file format itself. Just because it isn't self-describing doesn't mean it isn't described. It just means the schema is communicated outside of the data interchange.
If you get an .xls which doesn't have very esoteric functions, I expect it to open about the same way in any Excel program and any other office suite.
With CSV I do not have that expectation. I know that for some random user-submitted CSVs, I will have to fiddle. Even if that means finding the one row in thousand rows which has some null value placeholder, messing up the whole automatic inference.
No. That's not at all what I'm saying. I am saying that a fixed CSV file will open differently depending on the program you open it with.
Don't even need to transfer it. Opening a csv in pandas can be different than opening with polars, can be different to DuckDB, can be different to Excel.
You've got not guarantees. There's no spec, and how edge cases (if you want to call how to serialize and deserialize a float an edge case) are handled is open to the implementation.
It's both of their faults. CSV is not blameless here - Excel is doing something broadly that users expect, have dates as dates and numbers as numbers. Not everything as strings. If CSV had types then Excel would not have to guess what they are.
It does have types if you define them in the schema. Not every format needs to be self-describing. It's often more efficient to share the schema once outside of the data feed than have the overhead of restating it for every data point.
It's completely Excel's fault for pushing their type-inference and making it difficult for users to define or supply their own.
Power Query does a better job handling it, but you should be able to just supply a schema on import, like you can with Polars or DuckDb.
It's another example of MS babying their userbase too much. Like how VBA is single threaded only because threads are hard. They're making their product less usable and making it harder for their users to learn how stuff works.
Csv doesn’t have a schema, it has a barely adhered to post-hoc “not a specification” and everything is strings.
That you can solve some of these problems by using something as well as the csv file is not anywhere near as helpful, and it’s a clear problem of csv files. There is no universally followed schema, for a start, so now we’re at unique solutions all over the place.
> It's often more efficient to share the schema once outside of the data feed than have the overhead of restating it for every data point.
You cannot be suggesting that csv files are efficient surely, they’re atrociously inefficient. Having the same format and a tied in schema would solve a lot and add barely anything as overhead. If you want efficiency, do not use csv.
Asking users to manually load in the right schema every time they open a file is asking for trouble. Why wouldn’t you combine them?
> It's completely Excel's fault for pushing their type-inference and making it difficult for users to define or supply their own.
It’s not entirely excels fault that csv doesn’t have types. They didn’t invent and promote a new standard, but then why would you? There’s better formats out there. I’m sure they would argue that the excel files are a better format for a start.
And people did make better formats. That’s why I think csv should be consigned to the bin of history.
> "You can edit it in a text editor" which feels like a monkeys-paw wish
Yes :) Although I will note that some editors are good enough to maintain the structure as the user edits. Consider Emacs with `csv-mode`, for example. Of course most users don’t have Emacs so they’ll just end up using notepad (or worse, Word).
Car analogies are typical, so I'll add in there.
My car can take the four of us, and we can load it up with things from the shops. I can put a bunch of heavy tins of food in there, or some DIY things, but if I put several tons of stones in the boot it'll totally fuck it up.
Is that a design problem?
Not really, it's a relatively cheap regular car, and it failed at a certain scale.
It would be a design problem if it were a flatbed truck, despite it being the same scaling that showed the problem.
Making my car resilient enough to take that weight would require tradeoffs that would either make it worse for other jobs I want it to do or at least add significantly to the cost.
This is similar in engineering software systems too, you can make it handle scaling up better, but this can require a much more complex architecture that may make it slower at smaller scales. It can make it more complicated to work with, add additional risks of failure as well.
reply