Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The entire basis of the current generation of AI is in stolen materials. Stolen writing, stolen art, stolen music, all of it taken because it was "publicly available" meaning "there was nothing in existing law that said we couldn't take it for training a learning model." Now they've done the same with Dropbox contents.

AI is very cool technology. The incredible overstepping of any and all ethics with regard to getting training data, just all of the data from any source as fast as possible right now, in this mad dash to create it at all costs is not and should mandate a complete restart on behalf of the people behind it. For this one, and for so many other ethical lapses on the part of OpenAI, the models as they exist are tainted beyond any ethical use as far as I'm concerned.

If your product uses this stuff, you are not getting one red cent from me for it. Period, paragraph.



This is a silly argument. Public sharing on the internet has nothing in common with storing on a private storage service. Posting something online you are literally inviting people to look at it, and there is copyright law that governs how it can be copied. You can argue training AI models on internet data violates copyright (though it almost certainly doesn't) or that the law needs to be changed. But none of that has anything to do with training on private data. It's the same kind of "you wouldn't steal a car" false comparison.


> Public sharing on the internet has nothing in common with storing on a private storage service.

No shit.

> Posting something online you are literally inviting people to look at it,

Yes, PEOPLE. Posting things for people to see is why the Internet exists, is why USENET was created, is why web forums were created, is why social media was created, is why 9/10ths of the Internet as we know it today was brought into creation.

It was not put there so people who do not know any of those people and do not give a rats ass about what they made could take millions of images, writings, and sounds and shove them into their product without their consent for purposes it was never meant for so they could automate art. That is categorically not what any of that is for, and you, and everyone else making this tired point damn well know that.

If you have no issue at all with your creative output being used to train data models, more power to you! That's how consent works! You consent and that's completely, 100% fine. That consent should not have been presumed as it was, and even if you assume complete and total innocence on the part of the AI creators, once it became extremely fucking obvious that tons, and tons, and tons of creatives absolutely would not have consented if asked, then their data models should've been re-trained with that misused data removed. That is the ETHICAL thing to do: when someone says "hey I really don't like what you're doing with my material, please stop" you STOP, not because that's legally binding, not because you'll be sued, not because you're infringing copyrights, but because you have a fundamental respect for your fellow human being who says "I don't like this, please don't do it" and then you, you know, don't do it.

Unless of course what you actually are is an ethics-free for-profit entity that needs to get to market as soon as possible with a product to sell that you probably can't be sued over, in which case you tell those people who's work your product could not exist without to eat shit, and proceed anyway. Which is basically exactly what happened and continues to happen.

And before you even go there to the "well how could they ask for the entire dataset's contents" I DON'T CARE. I'm not the one doing this, this is not my problem to solve, just because the ethical way to do a thing is hard, time consuming, and/or expensive or otherwise difficult, you don't get to just waltz past the ethics involved, even if you're a research project! I personally wouldn't want to get permission from a few million artists to use their work in this way, I don't think most of them would be comfortable with it, and even if they were, I don't really want to do that, it sounds like a ton of work. SO I DIDN'T.


Not to mention that a lot of the time it wasn't just that they didn't ask first, but they ignored specific widely used and machine readable license abbreviations and copyright symbols attached to the content

If a corporation argues that it can ignore your AGPL because it didn't have to blow the bloody doors off to get hold of your code and its training process is "just like your browser cache" or "a person learning" and the derivative stuff is completely novel, why would you trust them not to deploy the same "but it's not exactly copying" arguments when given access to other stuff that has third-party "no copying" agreements wrapped around it, like your Dropbox?


Agreed, it feels like people's copyright wishes are being flagrantly ignored, with a sense of "well, it's too late now, just live with it".

And I do not buy the "just like a person learning" argument. At least, not fully.

I could see that, if you have a fully-functioning AI system, then handing it a new article to ingest could be "just like a person learning".

But many people graduate high school having read just a few dozen books (or less), having been around maybe a few dozen people (or less), and watched a few dozen movies (or less). A person does not need to ingest a nontrivial percentage of the entire wealth of human knowledge just to be able to be intelligent enough to read an article in the first place.

There may be people who do not care about this distinction. That's fine. But I am quite convinced that the distinction exists. And thus I do not believe that training an AI system is just like teaching a person -- and making copyright decisions on the basis that the two things are identical does not make sense.


> And thus I do not believe that training an AI system is just like teaching a person

100%. I love the way you put this and just wanted to expand on it a little bit, to remind everyone that there have been numerous, flagrant examples of various creators of various media who have their names/handles put into these models, who have work that is ludicrously similar to theirs in style produced from the model, and despite the fact that it's technically original, it is not original in any way meaningful to the topic, or defensible by anyone debating in good faith. That you need to put things like "unreal engine" "featured on artstation" and the like proves this. You're telling the machine to aim for works you know are of a higher quality in the dataset to get a better result.

Now if you're just fine with that and content to fuck over artists like that for no other reason than you can, I can't stop you. But please spare me the righteous indignation of objecting to the characterization of such behavior. It's fucking obvious, do not insult the intelligence of your opponents by insisting otherwise.


And tbh even if people are absolutely fine with that, and think that the analogies and legal arguments that they make are absolutely sound and maybe think copyright's a terrible idea anyway, I still can't see why they'd expect Big AI to suddenly drop the "don't care what you think about how we use your stuff, if it's not explicitly illegal we're going to use it" stance when it comes to stuff that's supposed to be 'private' rather than stuff that supposed to be 'property'.

Sure maybe you care more about whether OpenAI has stuff derived from the contents of your Dropbox on their servers which is technically neither "training a model" nor the actual "copy" they were required to delete after 30 days than you ever did about copyrighted stuff. But why would OpenAI?


It's well past time for the end of the Digital Millennium of Copyright.

The problem here is that these corporations are given carte blanche to make any derivative works they want. They get the exclusive freedom to ignore copyright law.

The rest of us don't.

The worst part is that they get to turn around and say their models are protected by copyright!

This is copyright laundering. There are only 2 reasonable avenues for us to react:

1. Make "AI" companies respect existing copyright law when compiling training datasets.

2. Get rid of digital copyright for everyone.

I vote option #2.


> Now they've done the same with Dropbox contents.

This is what I mean by the "trust crisis".

Dropbox very clearly deny that Dropbox content is being used to train AI models, by them or by OpenAI.

You don't believe them, because you don't trust them.


It's almost like acting completely unethically in the public space has consequences or something




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: