What are AI companies doing to respect open source licenses and copyright?
I'm sure they train their models on open source software, so how do I know that LLM generated code doesn't reproduce substantial chunks of, for example, GPL licensed code? If indeed there are GPL violations, what are AI companies doing to police themselves?
I wonder if open source licenses will start to include "not to be used for LLM training" clauses.
> I wonder if open source licenses will start to include "not to be used for LLM training" clauses
As if the LLM trainers would care. They've ignored every single license and copyright policy out there because "fair transformative use". It's undergoing litigation in various jurisdictions, and the chaotic side of me really wants to see what happens if a UK or California decide that training an LLM on pirated copyrighted material is not fair use, and the rights holders have to be compensated.
> For instance the Mozilla DOM experiments seems to use a special JS variant with a 'use component' header
As per the article, that's temporary until Component Model 1.0 is implemented natively in the browser. In the meantime, jco can be used:
> The groundwork for browser implementations is being laid today: jco’s transpile command already converts any component into equivalent core Wasm and JavaScript glue, making components runnable in any browser without native support.
That's no longer needed once native support is there.
Or people don’t want to be trained in it because while you’re doing it the industry keeps on inventing new things you’re supposed to know.
I’ve been offered a job doing cobol and another legacy language on core banking systems and I’m going to take it. I’m getting toward the end of my career so the risk is low and the work might be more interesting than fighting npm or feeding questions into a clanker
Bryan Cantrill warns, "Do not fall into the trap of anthropomorphizing Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don't anthropomorphize your lawnmower, the lawnmower just mows the lawn. You stick your hand in there and it'll chop it off, the end. You don't think 'oh, the lawnmower hates me' -- the lawnmower doesn't give a shit about you, the lawnmower can't hate you. Don't anthropomorphize the lawnmower. Don't fall into that trap about Oracle.":
While that's pithy, I think it's also incorrect, because it implies that Oracle / Ellison is controllable by us, in the same way a tool / lawnmower is. That's absolutely not true. It has its own motivations that are best-case neutral to our goals.
It might be better to think of ourselves as individual fish in a school of fishes, and Oracle is the boat with a mile long dragnet. It doesn't care about the individual fish; it's not worth it's time to consider us individually. It's thinking in terms of tonnage.
> For a while, I tried not going into Nazi allegory when talking about Oracle, but I actually think it does a disservice to not go into Nazi allegory because if I don't use Nazi allegory when referring to Oracle, there are some critical understanding that I have left on the table. There's an element of the story you can't possibly understand. If you had to explain the Nazis to someone who had never heard of World War 2, but was an Oracle customer, there's a very good chance that you would actually explain the Nazis in Oracle allegory. So, it's like: "Really, wow, a whole country?" "Yes, Larry Ellison has an entire country." "Oh my god, the humanity! The license audits!" "Yes, we should talk to Poland about it, it was bad."
Eh. That quote conflates Ellison and Oracle and I don't think that's correct. I think there's a danger in just accepting that a human being is abhorrent. It _should_ outrage us that Ellison is the way he is. It's silly to think "the lawnmower hates me" because it isn't capable of hate. Ellison is capable of hate and it's not deluded to think he might hate you and I and want to control our lives.
I'm sure they train their models on open source software, so how do I know that LLM generated code doesn't reproduce substantial chunks of, for example, GPL licensed code? If indeed there are GPL violations, what are AI companies doing to police themselves?
I wonder if open source licenses will start to include "not to be used for LLM training" clauses.
reply