Hacker Newsnew | past | comments | ask | show | jobs | submit | tensor's commentslogin

That depends entirely on where you are. In Ontario electricity is mostly hydro, nuclear, and renewables. But also, compared to burning gas directly, EVs are still more efficient and require less gas if you burn the gas to charge the EV.

Thank goodness Canada doesn't use its past mistakes as a bar that it's ok to go back to.

That's not true. There were always different roles for older people. They didn't just keep doing the same job their whole lives.

And people who were injured to the point where they couldn't "work" anymore were still cared for by their community.

I mean, that just isn't true. There are amazon tribes today where they just send them down the river to die... your ideas are a disney-fied version of a false past that never existed.

They're right. We've found remains that show how thousands of years ago people took care of people that would have died without external assistance.

https://phys.org/news/2025-10-ancient-patagonian-hunter-disa...


Unspecified Amazon tribes don't represent the lion's share of historical treatment of aging populations. One negative example doesn't undermine the point.

Yes, humanity is full of various societies that do things differently. These ideas aren't disney-fied - they're just accurate representations of the fact that people care for each other, most of the time.

I appreciate your anecdote, but here's a few counter-examples:

- Neanderthals took care of their elderly: https://theconversation.com/neanderthals-cared-for-each-othe...

- Neanderthals took care of a child that likely had a developmental condition: https://www.science.org/doi/10.1126/sciadv.adn9310

- other Hominids also did this at some point in the last few million years: https://www.nationalgeographic.com/science/article/deformed-...

- 2500 year old woman had a jaw prosthetic made: https://www.vice.com/en/article/mummified-skull-reveals-iron...

- 15k years ago, someone with a broken femur was cared for well enough to heal: https://www.forbes.com/sites/remyblumenfeld/2020/03/21/how-a...

- Neanderthals pre-chewed food or provided soft foods for someone who lost their teeth: https://www.sciencenews.org/article/care-worn-fossils

- 4000 years ago, a man who was almost certainly a quadraplegic was still being cared for: https://www.npr.org/sections/goatsandsoda/2020/06/17/8788963...


Do you have anything more interesting to say on the topic than "No U wrong"? The OP had a lot of thoughtful comments about the issues with having things to do after retiring.

You hit the nail squarely on the head. In days past when people retired they'd still help raise kids or look after households. When we moved past requiring that sort of thing, we left the elderly without engagement.

I'm not sure what the solution is, but perhaps as a society we could be more intentional about creating roles where the elderly can still help and feel useful, but also have flexibility and a more relaxed lifestyle.


There's not necessarily money in it, but in the current era, parents still find the grandparents' availability for minding children incredibly useful. If they also cleaned my house free or cheap, I'd be thrilled!

I mean, we're about to enter a demographic reversal and to hear economists talk of it, corporations are going to really struggle to find the workers they need.

I guess we're about to find out if they're desperate enough to offer genuine flexibility or not.

If I could work 2d/wk remote as a software developer, I'd probably do it the rest of my life. Something tells me that most CEOs are still gonna insist on 50+hrs/wk RTO though...


They shouldn’t just feel useful, they need roles that actually are useful. They’re not dumb.

Of course, though I still think remembering that people need to feel useful is important. E.g. you don't want to force someone into a job that may be useful but the person is feeling "why am I doing this, it's not needed." The goal is also not to fill time or a money quota. It's to do something helpful such that the person actually feels helpful.

Either:

1. They are "dumb" and the original statement stands

2. They are not "dumb" and a role that is actually useful is a necessary condition for them feeling useful and the original statement stands.


There are useful roles that could either be done by a human or a machine and the machine is usually more efficient.

Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.

Not on topic, but wow the internet has very quickly devolved into: click -> "making sure you're not a bot", click -> "making sure you're a human", click -> "COOKIES COOKIES COOKIES", click -> "cloudflare something something"

We had to set it up on the parts of VideoLAN infra so the service would remain usable.

Otherwise it was under a constant DDoS by the AI bots.


Maybe I’m naive about this, but I didn’t expect AI scrapers to be that big of a load? I mean, it’s not that they need to scrape the same at 1000+ QPS, and even then I wouldn’t expect them to download all media and images either?

What am I missing that explains the gap between this and “constant DDoS” of the site?


You cant really cache the dynamic content produced by the forges like Gitlab and, say, web forums like phpbb. So it means every request gets through the slow path. Media/JS is of course cached on the edge, so it's not an issue.

Even when the amount of AI requests isnt that high - generally it's in hundreds per second tops for our services combined - that's still a load that causes issues for legitimate users/developers. We've seen it grow from somewhat reasonable to pretty much being 99% of responses we serve.

Can it be solved by throwing more hardware at the problem? Sure. But it's not sustainable, and the reasonable approach in our case is to filter off the parasitic traffic.


Thanks, appreciate the details. 99% is far above the amount I expected, and if it specifically hits hard to cache data then I can see how that brings a system to its knees.

You kind of can though. You serve cached assets and then use JavaScript to modify it for the individual user. The specific user actions can't be cached, but the rest of it can.

Totally. Remember slashdot in the 1990s used to house a dynamic page on a handful of servers with horsepower dwarfed by a Nintendo Switch that had a user base capable of bringing major properties down.

The "can't" comes from the fact that VLC is not going to rewrite their forum software or software forge.

Software written in PHP is in most cases frankly still abysmally slow and inefficient. Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better. Pathetic throughput at best and it has not gotten better in decades now.

I don't know how GitLab became so disgustingly slow. But yeah, I'm not surprised bots can easily bring it to its knees.


> Wordpress runs like 70% of the web and you can really feel it from the 1500ms+ TFFB most sites have. PhpBB is not much better.

At least phpBB died 15 years ago with most communities migrating to Xenforo. I'm not quite sure how or why WP is still around with so many SSGs and SaaS site builders floating around these days.


Xenforo is not much better an has many "administrators" whining about bot traffic as well.

The funniest part about WordPress is that you can usually achieve at least a 50% speed boost or more by adding a plugin that just minifies and caches the ridiculous number of dynamic CSS and JS files that most themes and plugins add to every page. Set those up with HTTP 103 Early Hints preload headers (so the browser can start sending subresource requests in the background before the HTML is even sent out, exactly the kind of thing HTTP/2 and /3 were designed to make possible) and then throw Cloudflare or another decent CDN on top, and you're suddenly getting TTFBs much closer to a more "modern" stack.

The bizarre thing is that pretty much no CMS, even the "new" ones, seems to automate all of that by default. None of those steps are that difficult to implement, and provide a serious speed boost to everything from WordPress to MediaWiki in my experience, and yet the only service that seems to get close to offering it is Cloudflare.

Even then, Cloudflare's tooling only works its best if you're already emitting minified and compressed files and custom written preload headers on the origin side, since the hit on decompressing all the origin traffic to make those adjustments and analyses is way worse for performance than just forwarding your compressed responses directly, hence why they removed Auto Minify[1] and encourage sending pre-compressed Brotli level 11 responses from the origin[2] so people on recent browsers get pass-through compression without extra cycles being spent on Cloudflare's servers.

The solution seems pretty clear: aim to get as much stuff served statically, preferably pre-compressed, as you can. But it's still weird that actually implementing that is still a manual process on most CMSes, when it shouldn't be that hard to make it a standard feature.

And as for Git web interfaces, the correct solution is to require logins to view complete history. Nobody likes saying it, nobody likes hearing it. But Git is not efficient enough on its own to handle the constant bombardment of random history paginations and diffs that AI crawlers seem to love. It wasn't an issue before, because old crawlers for things like search engines were smart enough to ignore those types of pages, or at least to accept when the sysadmin says it should ignore those types of pages. AI crawlers have no limits, ignore signals from site operators, make no attempts to skip redundant content, and in general are very dumb about how they send requests (this is a large part of why Anubis works so well; it's not a particularly complex or hard to bypass proof of work system[3], but AI bots genuinely don't care about anything but consuming as many HTTP 200s as a server can return, and give up at the slightest hint of pushback (but do at least try randomizing IPs and User-Agents, since those are effectively zero-cost to attempt).

[1]: https://community.cloudflare.com/t/deprecating-auto-minify/6...

[2]: https://blog.cloudflare.com/this-is-brotli-from-origin/

[3]: https://lock.cmpxchg8b.com/anubis.html but see also https://news.ycombinator.com/item?id=45787775 and then https://news.ycombinator.com/item?id=43668433 and https://news.ycombinator.com/item?id=43864108 for how it's working in the real world. Clearly Anubis actually does work, given testimonials from admins and wide deployment numbers, but that can only mean that AI scrapers aren't actually implementing effective bypass measures. Which does seem pretty in line with what I've heard about AI scrapers, summarized well in https://news.ycombinator.com/item?id=43397361, in that they are basically making no attempt to actually optimize how they're crawling. The general consensus seems to be that if they were going to crawl optimally, they'd just pull down a copy of Common Crawl like every other major data analysis project has done for the last two decades, but all the AI companies are so desperate to get just slightly more training data than their competitors that they're repeatedly crawling near-identical Git diffs just on the off-chance they reveal some slightly different permutation of text to use. This is also why open source models have been able to almost keep pace with the state of the art models coming out of the big firms: they're just designing way more efficient training processes, while the big guys are desperately throwing hardware and crawlers at the problem in the desperate hope that they can will it into an Amazon model instead of a Ben and Jerry’s model[4].

[4]: https://www.joelonsoftware.com/2000/05/12/strategy-letter-i-... - still probably the single greatest blog post ever written, 26 years later.


> And as for Git web interfaces, the correct solution is to require logins to view complete history.

Why logins, exactly? Who would have such logins; developers only, or anyone who signs up? I'm not sure if this is an effective long-term mitigation, or simply a “wall of minimal height” like you point out that Anubis is.


I think there's a few things at play here

- AI scrapers will pull a bunch of docs from many sites in parallel (so instead of a human request where someone picks a single Google result, it hits a bunch of sites)

- AI will crawl the site looking for the correct answer which may hit a handful of pages

- AI sends requests in quick succession (big bursts instead of small trickle over longer time)

- Personal assistants may crawl the site repeatedly scraping everything (we saw a fair bit of this at work, they announced themselves with user agents)

- At work (b2b SaaS webapp) we also found that the personal assistant variety tended to hammer really computationally expensive data export and reporting endpoints generally without filters. While our app technically supported it, it was very inorganic traffic

That said, I don't think the solution is blanket blocks. Really it's exposing sites are poorly optimized for emerging technology.


Also, relevant for forges: AI doesn't understand what it's clicking on. Git forges tend to e.g. have a lot of links like “download a tarball at this revision” which are super-expensive as far as resources go, and AI crawlers will click on those because they click on every link that looks shiny. (And there are a lot of revisions in a project like VLC!) Much, much more often than humans do.

This is also irrelevant to the original comment which is complaining about bot checks for looking at the root of the repositiory - which is probably the highest requested resource and should be 100% served from cache with a cost much less than running the bot checks.

It's simply bad, inefficient software and we shouldn't keep making excuses for it.


Agree. Did some basic searching and looks like Gitlab is particularly bad. It ships with built in rate-limiting but the backend marks all pages as uncacheable on top of them being somewhat dynamically generated (I guess it caches "page fragments").

The only issues I found amounted to "here's how to use Anubis to block everything"

There's also some new but poorly supported standards around agents setting `Accept: text/markdown` and https://github.com/cloudflare/web-bot-auth


They are a scourge, they never rate-limit themselves, there are a hundred of them, and a significant number don’t respect robots.txt. Many of them also end up our meta:no-index,no-follow search pages leading to cost overruns on our Algolia usage. We spend way too much time adjusting WAF and other bot-controls than we should have.


Thanks. I imagine there is a (a) a lot of interest in scraping source code, and (b) many requests to forges hitting expensive paths. 99% of volume though, wow, much more than expected.

You've gotten several comprehensive responses so far and I want to add a niche corner that people might assume might not have the bot problem but still does.

I run a website that hosts tools for my family: games and a TV interface for the kids, remote access to our family cloud and cameras, etc. Sensitive things require log in and have additional parameters required for access of course.

I specifically blocked bots from search engines so my site is never indexed, as I'm not selling anything nor want any attention, as well as some other public non-malicious bots in case they communicate with Google, just to be safe there, and my robots.txt doesn't allow anything.

I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known (and isn't simple enough to be quickly guessed), and most traffic must be malicious, or indexing (AI overview and other scrapers won't be finding it via web search).

Since it isn't indexing, and keeping everything in simple black and white boxes, my remaining traffic is family or malicious bots, and 99.9% isn't family.

I currently have the most strict bot-blocking setup I could come up with, which nicely cut down on quite a bit of traffic, but I do still receive ~2k attempts per day, which as you can imagine, still is around 99% not traffic, as I have fewer than 20 kids, and my kids aren't using the site nonstop.

Conveniently, my setup has never accidentally blocked a family member, so I'm pleased with the setup.


> I assume then, that the only way a bot could even find my site is to do what the indexers do: brute force try every single possible ipv4 address hoping to hear something back, as my domain should not be known

If your site uses https, they could also get your domain from the certificate transparency logs for the certificate you use.


I didn't think of that, but that makes complete sense, as it is https. I think my info was sold by my registrar as well because solicitors call or email me on occassion because they "accidentally came across my site" and want to provide the design/js/etc help.

You can get around this by grabbing a wildcard certificate and then using a hard-to-guess subdomain.

While I do sympathetize with the AI DDoS situation, it'd be nice if there were a solution that allows them to work so they can pull official docs.

For instance, MCP, static sites that are easy to scale, a cache in front of a dynamic site engine


Of course, static websites is the best solution to that problem.

Our documentation and a main website are not fronted by this protection, so they're still accessible for the scrapers.


Have you considered making the pages mostly static and cacheable for non logged in users first. There is no reason a repository listing needs to be this resource intensive.

I highly doubt there is no other technically feasible option to block the AI bots. You end up blocking not just bots, but many humans too. When I clicked on the link and the bot block came up, I just clicked back. I think HN posts should have warnings when the site blocks you from seeing it until you somehow, maybe, prove you are human.

I'm sure there are many solutions for many problems, but expecting a small Foss development team to know or implement them all is rather unreasonable.

I think the world gains more if the VLAN team focuses on their amazing, free contribution to the world, than if they spend the same time trying to figure out how to save you two clicks.

We all hate that this is happening, but you don't need to attack everyone that is unfortunately caught up in it.


> I highly doubt there is no other technically feasible option to block the AI bots.

If you have discovered such an option, you could get very wealthy: minimizing friction for humans in e-commerce is valuable. If you're a drive-by critic not vested in the project, then yours is an instance of talk being cheap.


I'm all ears on how we can fix it otherwise.

Keep in mind that those kinds of services: - should not be MITMed by CDNs - are generally ran by volunteers with zero budget, money and time-wise


First off, don't block the first connection of the day from a given IP. Rate limit/block from there, for example how sshguard does it.

I've seen several posts on HN and elsewhere showing many bots can be fingerprinted and blocked based on HTTP headers and TLS.

For the bots that perfectly match the fingerprint of an interactive browser and don't trigger rate limits, use hidden links to tarpits and zip bombs. Many of these have been discussed on HN. Here's the first one that came to memory: https://news.ycombinator.com/item?id=42725147


> don't block the first connection of the day from a given IP.

The bots come from a huge number of IP addresses, that won't really help that much. And it doesn't solve the UX problem either, because most pages require multiple requests for additional assets, and requiring human verification then is a lot more complicated than for the initial request.

> For the bots that perfectly match the fingerprint of an interactive browser

That requires properly fingerprinting the browser, which will almost certainly have false positives from users who use unusual browsers or use anti-fingerprinting.

> use hidden links to tarpits and zip bombs.

That can waste the bot users resources, but doesn't necessarily protect your site from bots. And Also requires quite a bit of work that small projects don't have time for.

Unless there is a prebuilt solution that is at least as easy to deploy and at least as effective as something like anubis, it isn't really practical for most sites.


Nearly every single website I'm not logged into these days want me to "confirm I'm not a bot".

it is incredibly annoying but what can you do? AI scrapers ruined the web.


The internet is such a Tragedy of the Commons… its citizens that act selfishly and in bad faith will slowly make it unusable.

Its pretty explicitly not a tragedy of the commons. Its a tragedy of the ruling class abusing the resources of the 'commons' to extract value. There is nothing 'commons' about trillion dollar companies extracting all available value from the labor of the working class. That's just the tragedy that'll bring around the death of society, the same tragedy that brings all other tragedys

The commons in question is the internet itself.

Thank you for describing the tragedy of the commons

The commons were never unregulated. This is a tragedy of enclosure.

https://en.wikipedia.org/wiki/Enclosure


We might get stuck discussing semantics. Large parts of the internet remain public and open; but an incredibly vast part of it is a toxic wasteland.

It’s true that many spaces people frequent are ‘enclosed’, but these are also less subject to the abuse taking place in the public and open areas.


There’s definitely lots of problems with the ruling class and wealth disparity. Perhaps the defining problems of our current age.

That being said, so many of the plebs suck. Like 2% will ruin everything for everyone.


While a lot of the plebs do suck, a pleb who sucks causes way less problems than a big corp that sucks simply by virtue of not having too much resources.

I agree.

But whether you agree with me or not, most paradigm shifting changes come from billionaires/corps because they are the only ones with the money to pull off massive shifts. Most innovation is not grassroots and heavily funded by the “elites”. This is how most successful countries have been for atleast the last 100 years. So billionaires add a lot of value even as they cause a lot of pain.

The solution in my mind is we absolutely need uncapped billionaires but they need to be effectively taxed (not like 90% but closer to 50%) and they have to have absolutely no influence on the government.


tragedy of the commons with your ideological buzzwords sprinkled in, truly innovative

No, it is because citizen allow treating them like this.

> its citizens that act selfishly and in bad faith will slowly make it unusable

It's rarely been the citizens that have been the problem, but the governments and companies that seek the use the network connection for their overwhelming benefit.

Re (above):

> Not on topic, but wow the internet has very quickly devolved into: click -> "making sure you're not a bot", click -> "making sure you're a human", click -> "COOKIES COOKIES COOKIES", click -> "cloudflare something something"


wat. The protections in place that the OP is talking about are almost entirely due to (not government and company) bad actors.

No one's even clicking anymore, everything implores me to tap or swipe these days, and everything is optimised for humans with one eye above the other.

Then I press the X to close the all-caps banner commanding me to install the app, upon which I get sent to the app store. Users of the website refer to it as an app.


Their bot-detection page took more than 40 seconds to complete on my low-end smartphone. This sucks.

Wow I’m glad it’s not just me. I thought my IP block had gotten caught up in some known spamming or something.

At least this one was significantly faster than Cloudflare and required no action on my part.

I get exactly none of that. Is your adblocker still working?

renders your gigabit connection pointless

AI is a gift that keeps on giving.

High hardware prices, locked information sources, plenty of AI slop etc.


Rather people unable to setup a static websites where needed.

I hate that I can't do a curl, or automate my curls to retrieve data from the web because I either see some cloudfrare protection or some captcha.

Information is blocked in walled gardens.


I still find the idea that "learning" from code is "stealing" kind of ridiculous.

The "learning" isn't learning really. I mean it might be, but if you define learning to be a human endeavor than AI can't learn.

It's perfectly reasonable to say it's okay for humans to do something but not okay for a computer program to do the same thing. We don't have to equate AI to humans, that's a choice and usually a bad one.


It's also perfectly reasonable to say it's ok for a program or machine to do the same thing as a human. This has been the basis for the technological revolution since the dawn of technology.

It's legal and perfectly reasonable for a human being to combine organic fuels with oxygen from the air to create energy and CO2. Any law restricting that would be the worst form of tyranny.

It would not be reasonable to allow machines to do that at unlimited scale without restrictions.

(Hopefully the fossil fuels industry won't draw inspiration from the legal arguments made by AI companies...)


> It's legal and perfectly reasonable for a human being to combine organic fuels with oxygen from the air to create energy and CO2.

Is there any line past which it becomes unreasonable?

> It would not be reasonable to allow machines to do that at unlimited scale without restrictions.

If the machines were a replacement for a damaged respiratory system in a human would it reasonable?

What about if the machine were being used by a human to do something else that was important?

Where is the line where it becomes reasonable?


> Is there any line past which it becomes unreasonable?

That's exactly the question we should be asking about AI and fair use.


Are you refusing to engage with your own metaphor?

You're taking the metaphor much too seriously. It was only an example to illustrate that human rights don't automatically apply to machines. Let's not read too much into it.

You made a claim and used a metaphor to demonstrate that claim. I asked a very simple question about the bounds of the metaphor and thus the claim. You are dodging answering the questions which mean that you cannot defend the logic of your claim. Thus you have forfeited that your claim is valid and 'human rights don't automatically apply to machines' has not been illustrated.

Fortunately I don't care whether you're convinced. I doubt our discussion here will change policy in any way.

What's your strategy for solving problems where there are diverse viewpoints if there is no desire to convince anyone else? Rhetoric is time proven set of communication standards that allow us to demonstrate the validity of our positions and thus gives us a way to find agreement or at least understand what others think. Few people are completely irrational and understanding why they think what they do, even if one does not agree with them, is important in a system where people have to co-exist with the decisions that effect everyone.

Because the alternative would be to just railroad people who don't agree, and even when it does work in one's favor the pendulum tends to swing back hard in response.


If one defines 'flying' to be a bird's endeavor, then humans can't fly.

Now, if you'll excuse me, I need to catch a metal shuttle that chucks itself through the air on wings.


Sure as a word it can be broad, as a concept in our legal system that should be much more nuanced.

The relevant extension of your analogy is should birds be required to obey FAA rules? Or should plane factories be protected as nesting sites?



It's a relevant extension if you think the ability to learn from a work is a right people have that exempts them from the more general lockdown copyright would impose.

If you come at it from the view of copyright being a limited set of control over some areas but not others, then if copyright doesn't block human learning it shouldn't affect anything similar either, unless a specific rule is added to make those situations be handled differently.


Yes I guess there's also no such thing as stealing in torrents since the computer "learns" the data and returns it in a transcoded fashion so it's technically not a reproduction. Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".

The mental calisthenics required to justify this stuff must be exhausting.


> The mental calisthenics required to justify this stuff must be exhausting.

It's only exhausting if you think copyright ever reasonably settled the matter of ownership of knowledge and want to morally justify an incoherent set of outcomes that they personally favor. In practice it's primarily been a tool for the powerful party in any dispute to hammer others for disrupting their business model. I think that's pretty much the only way attempting to apply ownership semantics to knowledge or information can end up.


Correct.

Knowledge consists of, roughly speaking, thoughts.

(a "justified true belief" - per https://plato.stanford.edu/entries/knowledge-analysis/ - is a kind of thought)

The "thinking" part of a "thinking being" - that also consists of thoughts.

If your knowledges are someone's property, you are someone's property.

A society where all knowledge is proprietary, is a society of ubiquitous slavery.

Maybe multi-layered, maybe fractional, maybe with a smiley-face drawn on top.

Doesn't matter.


Humans have been known to recite entire parts from plays from memory, live in front of audiences even.

And they are legally required to license the play to do that, if it's still in copyright.

Only to perform it, not learn it.

And LLMs perform when you prompt them.

> Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".

Are you finding people that actually say this?

When it can quote something like that, it's a training error. A popular enough work gets quoted and copied by people online, and then it's not properly deduplicated. It's a very small fraction of works it can do that with, and the cleaner your data the less it happens.

I'll once again quote that stable diffusion launched with fewer weights than training images. It had some accidental memorizations, but there wasn't room for its core functionality to be memorization-based.


This is a perfect example of 'begging the question'. Arriving at a conclusion from a fact assumed as true without evidence. Your reductio does not actually demonstrate that copyright applies to LLMs, because you did not demonstrate how transcoding is comparable to inference, just that LLMs can reproduce some passages from copyrighted works. You could also produce passages from copyrighted works by generating enough random sequences of words, but no one is arguing that is comparable to transcoding. That the people who do not share this conclusion are engaging in motivated reasoning is based only on your assumption and has no logical backing, and is therefore begging the question.

"Learning" for LLMs is just as goofy and propagandistic a metaphor as "stealing" for copyright. I find it predictive of your position that you'll accept one dumb metaphor for something that we didn't need a metaphor for, but not the other.

Are you for stealing and against learning?

We know exactly what is happening in both cases. We can talk about that, or we can use obfuscating euphemisms that make our preferred position seem obviously true.


I find it more ridiculous to equate the act of a human learning with for-profit AI training without recompense to the authors of the training material.

I think that it's absurd that we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.

I mean I don't think think I could find a better description for following the derivatives of error in reproducing a set of works as creating a "derivative work".


>> ... we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.

I agree. However, the reverse is also likely true, i.e., it cannot currently be denied that learning in humans is different from learning in artificial neural networks from the point of view of production of works that mix ideas/memes from several works processed/read. Surely, as the article says, copyright law talks exclusively about humans, not machines, not animals.


I understand the article - the point about 'learning' is that if the model and its outputs are a derivative works then the copyright belongs to the human creators of the works it was trained on.

Edit*: Or perhaps put more pseudo legally that the created works infringe on the copyrights of the original human creators.


The part I agree to is that copyright law calls out humans specifically as the potential owners of copyright. So what you suggest seems to be the only possibility out. Calling out humans could imply that when a human reads a thousand books and then writes something basis the same but which is not a substantial copy of anything explicitly read, that human owns the copyright to the text written. Whereas, if an artificial neural network does the same (hypothetically writing the same text), it would not.

The above does not follow from, imply or conclude anything about learning in artificial neural networks and humans being similar or dissimilar.


Learning, probably not.

Copy/pasting at scale, yes


It is learning though. It’s not just copying the code.

Code gets turned into tokens and then it learns the next most likely token.

The issue that I see most people talk about it the scale at which is learnt.

A human will learn from other people’s code but not from every persons code.


The issue is that of copyright law WRT to derivative works. Machine transformations on original works does not create a new copyright for the person that directed the machine transformation. That's why you can't pirate a bunch of media by simply adding a red pixel to the righthand corner or by color shifting the video.

Copyright law is very clear that if a machine does it, the original copyright on the input is kept. This is why your distributed binaries are still copyrighted, because the machine transformed, very significantly, the source code into binary which maintains the copyright throughout.

It would be inconsistent for the courts to suddenly decide that "actually, this specific type of machine transformation is actually innovative."

I know this is generally really bad for the AI industry, so they just ignore it until a court tells them they can't anymore. And they might get away with it as I don't have faith that the courts will be consistent.


Shredding is a machine transformation. Does it mean that shreds retain original copyright even if the content can't be restored and the provenance can't be traced? Just an example that treating all machine transformations equally with no regard to the specifics doesn't make much sense.

And the specifics of autoregressive pretraining is that it is lossy compression. Good luck finding which copyrighted materials have made it into the final weights.


> Does it mean that shreds retain original copyright even if the content can't be restored?

Yup, it absolutely does. In fact, that's why you are still violating copyright law by using bittorrent even though each of the users is only giving out a small slice or shred of the original content.

The US has a granted defense in the case of something like shredding called "Fair Use" but that doesn't mean or imply that a copyright is void simply because of a fair use claim.

> And the specifics of autoregressive pretraining is that it is lossy compression.

That doesn't matter. Why would it? If I take a FLAC recording and change it to an MP3. The fact that it was a lossy transform doesn't suddenly give me the legal right to distribute the MP3.

> Good luck finding which copyrighted materials have made it into the final weights.

That's what the NYT v. OpenAI lawsuit is all about. And for earlier models they could, in fact, pull out full NYT articles which proved they made it into the final weights.

Further, the NYT is currently in discovery which means OpenAI must open up to the NYT what goes into their weights. A move that, if OpenAI loses, other litigants can also use because there's a real good shot that OpenAI also included their works in the dataset.


> Yup, it absolutely does

Well, it's not the first time when the law contradicts laws of nature (for the entertainment of the future generations). Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.

> in fact, pull out full NYT articles

That's when they used their knowledge of the exact text they wanted to "retrieve" to get the text? It wouldn't be so efficient with a random number generator, but it's doable.


> Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.

You can restore shredded documents with enough time and effort. And if you did that and started making photo copies, even if they are incomplete, you will run afoul of copyright law.

Bittorrent is a relevant example because it shows that shredding doesn't destroy copyright.

Remember, copyright is about the right to copy something. Simply shredding or destroying a thing isn't applicable to copyright. Nor is giving that thing away. What's applicable is when you start to actually copy the thing.


I've meant idealized shredding: a destructive transformation, which is still a machine transformation (think blender instead of shredder). When you need the exact knowledge of a thing to make its (imperfect) copy using some mechanism, it doesn't mean that the mechanism violates copyright.

EDIT: I don't say that neural networks can't rote learn extensive passages (it's an effect of data duplication). I'm saying that they are not designed to do that and it's possible to prevent that (as demonstrated by the latest models).


I'd assume it's still a copyright violation if you copied and distributed the shredded copy.

The way I arrive at that is imagine you add just 1 pixel of static to a video, that'd still be a copyright violation. Now imagine you slowly keep adding those random pixels. Eventually you get to the point where the whole video is just static, but at some point it wasn't.

Now, would any media company or court sue over that? Probably not. But I believe that still falls under copy right (but maybe fair use?).

The issue with neural networks is they aren't people. Even when you point your LLM at a website and say "summarize this" the output of that summation would be owned by the website itself by nature of it being a machine transformed work.

Remembered, it's not just mere rote recitation which violates the law, any transformation counts as well. The fact that AI companies are preventing it doesn't really solve the problem that they are in fact transforming multiple copyrighted works into their responses.


When you point your browser at a website the browser creates a (transformed) local copy of the information that is owned by the website itself. The browser needs to do that to render the website on your screen. Is it a violation of copyright (that the website is willing to tolerate because it profits from advertisements)?

No, because your browser is dealing with the distribution of data in a way intended by the copyright holder. You also aren't redistributing the webpage after rendering. Client side modifications fall under fair use which is what keeps the likes of ad blockers and other page modifiers legal.

What would violate copyright is if you took that rendered page, turned it into a jpeg, and then hosted that jpeg from your own servers. That's the copying that would run afowl of copyright law.


LLMs seem to be so devoid of intelligence, I think it's arguable if that's learning: https://machinelearning.apple.com/research/illusion-of-think... Typically, you would imply a level of understanding when you say learning. LLMs apparently can't do that, by design.

A human is not a commercial product. Here we have commercial product that was created by using a lot of various copyrighted and protected IP, without licensing agreements, without paying, without even citing it.

Copy/pasting at scale is how tons of software has been written for a long time, or have we all forgotten the jokes people used to make about StackOverflow?

If you can set a copyright trap and an LLM reproduces it I think it's pretty clear cut that it's more than just "learning".

I have seen LLMs do all sorts of crap which was clearly reproduction of training material.

This is also why people are most impressed with how much better it is at reproducing boilerplate rather than, say, imaginative new ideas.


Remember last year (?) when one of the major AIs produced a bit of code that included Jeff Geerling's name in a comment?

Is "learning" the correct term?

Or is it "plagiarism"?


If there were the case, then imagine having to give it back!

If I “learned” your essay and handed it in, would you be happy with that?

The US is sure becoming an unfree scary place just like Russia. Keep it up following those role models!

>It's literally the largest registrar in the world, by a large margin. When you're a business and want something reliable, picking the most popular provider is usually a strategy that works decently well. They're more likely to have established processes that work for all sorts of cases.

It's also literally one of the most criticized and awful registrars in the world, by a large margin. If decades of stories like this don't convince you to go with a more reliable registrar then I have very little sympathy.

This story is not egregious, it's in fact typical of GoDaddy. Every so often we get a HN post with a GoDaddy horror story. You'd think people would have learned by now.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: