• ell1e@leminal.space
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    2
    ·
    edit-2
    4 hours ago

    Ultimately, the policy legally anchors every single line of AI-generated code

    How would that even be possible? Given the state of things:

    https://dl.acm.org/doi/10.1145/3543507.3583199

    Our results suggest that […] three types of plagiarism widely exist in LMs beyond memorization, […] Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, […] Plagiarized content can also contain individuals’ personal and sensitive information.

    https://www.theatlantic.com/technology/2026/01/ai-memorization-research/685552/

    Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books. […] This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. […]The Stanford study proves that there are such copies in AI models, and it is just the latest of several studies to do so.

    https://www.twobirds.com/en/insights/2025/landmark-ruling-of-the-munich-regional-court-(gema-v-openai)-on-copyright-and-ai-training

    The court confirmed that training large language models will generally fall within the scope of application of the text and data mining barriers, […] the court found that the reproduction of the disputed song lyrics in the models does not constitute text and data mining, as text and data mining aims at the evaluation of information such as abstract syntactic regulations, common terms and semantic relationships, whereas the memorisation of the song lyrics at issue exceeds such an evaluation and is therefore not mere text and data mining

    https://www.sciencedirect.com/science/article/pii/S2949719123000213#b7

    In this work we explored the relationship between discourse quality and memorization for LLMs. We found that the models that consistently output the highest-quality text are also the ones that have the highest memorization rate.

    https://arxiv.org/abs/2601.02671

    recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures […]. We investigate this question […] our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

    How does merely tagging the apparently stolen content make it less problematic, given I’m guessing it still won’t have any attribution of the actual source (which for all we know, might often even be GPL incompatible)?

    But I’m not a lawyer, so I guess what do I know. But even from a non-legal angle, what is this road the Linux Foundation seems to embrace of just ignoring the license of projects? Why even have the kernel be GPL then, rather than CC0?

    I don’t get it. And the article calling this “pragmatism” seems absurd to me.

    • FauxLiving@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      3
      ·
      edit-2
      53 minutes ago

      Given the research that you’ve done here I’m going to assume that you’re looking for an answer and not simply taking us on a gish gallop.

      Your premise, and what appears to be the primary source of confusion, is built on the idea that this is ‘stolen’ work which, from a legal point of view, is untrue. If you want to dig into why that is, look into the precedent setting case of Authors Guild, Inc. v. Google, Inc. (2015). The TL;DR is that training AI on copyrighted works falls under the Fair Use exemptions in copyright law. i.e. It is legal, not stealing.

      The case you linked from Munich shows that other country’s legal systems are interpreting AI training in the same way. Training AI isn’t about memorization and plagiarism of existing work, it’s using existing work to learn the underlying patterns.

      That isn’t to say that memorization doesn’t happen, but it is more of a point of interest to AI scientists that are working on understanding how AI represents knowledge internally than a point that lands in a courtrooom.

      We all memorize copyrighted data as part of our learning. You, too, can quote Disney movies or Stephen King novels if prompted in the right way. This doesn’t make any work you create automatically become plagarism, it just means that you have viewed copyrighted work as part of your learning process. In the same way, artists have the capability to create works which violate the copyright of others and they consumed copyrighted works as part of their learning process. These facts don’t taint all of their work, either morally or legally… only the output that literally violates copyright laws.

      The pragmatism here is recognizing that these tools exist and that people use them. The current legal landscape is such that the output of these tools is as if they were the output of the users. If an image generator generates a copyrighted image then the rightsholder can sue the person, not the software. If a code generator generates licensed code then the tool user is responsible.

      This is much like how we don’t restrict the usage of Photoshop despite the fact that it can be used to violate copyright. We, instead, put the burden on the person who operates the tool

      That’s what is happening here. Linus isn’t using his position to promote/enforce/encourage LLM use, nor is he using his position to prevent/restrict/disallow any AI use at all. He is recognizing that this is a tool that exists in the world in 2026 and that his project needs to have procedures that acknowledge this while also ensuring that a human is the one responsible for their submissions.

      This is the definition of pragmatism (def: action or policy dictated by consideration of the immediate practical consequences rather than by theory or dogma).

      e: precedent, not president (I’m blaming the AI/autocorrect on this one)

        • anarchiddy@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          2 hours ago

          The Linux Kernel is under a copyleft license - it isnt being copyrighted.

          But the policy being discussed isn’t allowing the use of copyrighted code - they’re simply requiring any code submitted by AI be tagged as such so that the human using the agent is ultimately responsible for any infringing code, instead of allowing that code go undisclosed (and even ‘certified’ by the dev submitting it even if they didnt write or review it themselves)

          Submissions are still subject to copyright law - the law just doesnt function the way you or OP are suggesting.

        • anarchiddy@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          LLMs themselves being products of copyright isnt the legal question at issue, it’s the downstream use of that product.

          If I use a copyright-infringing work as a part of a new creative work, does that new work infringe copyright by default? Or does the new work need to be judged itself as to the question of infringing a copyrighted work?

          And if it is judged as infringing, who is responsible for the damage done? Can I pass the damages back to the original infringing work? Or should I be held responsible for not performing due diligence?