• Aceticon@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    15
    arrow-down
    1
    ·
    edit-2
    11 hours ago

    It’s even simpler than that: using an LLM to write a C compiler is the same as downloading an existing open source implementation of a C compiler from the Internet, but with extra steps, as the LLM was actually fed with that code and is just re-assembling it back together but with extra bugs - plagiarism hidden behind an automated text parrot interface.

    A human can beat the LLM at that by simply finding and downloading an implementation of that more than solved problem from the Internet, which at worse will take maybe 1h.

    The LLM can “solve” simple and well defined problems because its basically plagiarizing existing code that solves those problems.

    • MagicShel@lemmy.zip
      link
      fedilink
      English
      arrow-up
      9
      ·
      10 hours ago

      Hey, so I started this comment to disagree with you and correct some common misunderstandings that I’ve been fighting against for years. Instead, as I was formulating my response, I realized you’re substantially right and I’ve been wrong — or at least my thinking was incomplete. I figured I’d mention because the common perception is arguing with strangers on the internet never accomplishes anything.

      LLMs are not fundamentally the plagiarism machines everyone claims they are. If a model reproduces any substantial text verbatim, it’s because the LLM is overtrained on too small of a data set and the solution is, somewhat paradoxically, to feed it more relevant text. That has been the crux of my argument for years.

      That being said, Anthropic and OpenAI aren’t just LLM models. They are backed by RAG pipelines which are verbatim text that gets inserted into the context when it is relevant to the task at hand. And that fact had been escaping my consideration until now. Thank you.

      • Aceticon@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        7 hours ago

        Even the LLM part might be considered Plagiarism.

        Basically, unlike humans it cannot assemble an output based on logical principles (i.e. assembled a logical model of the flows in a piece of code and then translate it to code), it can only produce text based on an N-space of probabilities derived from the works of others it has “read” (i.e. fed to it during training).

        That text assembling could be the machine equivalent of Inspiration (such as how most programmers will include elements they’ve seen from others in their code) but it could also be Plagiarism.

        Ultimately it boils down to were the boundary between Inspiration and Plagiarism stands.

        As I see it, if for specific tasks there is overwhelming dominance of trained weights from a handful of works (which, one would expect, would probably be the case for a C-compiler coded in Rust), then that’s a lot more towards the Plagiarism side than the Inspiration side.

        Granted, it’s not the verbatim copying of an entire codebase that would legally been deemed Plagiarism, but if it’s almost entirely a montage made up of pieces from a handful of codebases, could it not be considered a variant of Plagiarism that is incredibly hard for humans to pull off but not so for an automated system?

        Note that obviously the LLM has no “intention to copy”, since it has no will or cognition at all, what I’m saying is that the people who made it have intentionally made an automated system that copies elements of existing works, which normally assembles the results from very small textual elements (same as a person who has learned how letters and words work can create a unique work from letters and words) but with the awareness that in some situations that automated system they created can produce output based on an amount of sources which is very low to the point that even though it’s assembling the output token by token, it’s pretty much just copying whole blocks from those sources same as a human manually copying a text from a document to a different document would.

        In summary, IMHO LLMs don’t always plagiarize, but can sometimes do it when the number of sources that ended up creating the volume of the N-dimensional probabilistic space the LLM is following for that output is very low.

        • MagicShel@lemmy.zip
          link
          fedilink
          English
          arrow-up
          5
          ·
          7 hours ago

          I agree with you on a technical level. I still think LLMs are transformative of the original text and if

          when the number of sources that’s what ultimately created the volume of the N-dimensional probabilistic space they’re following is very low.

          then the solution is to feed it even more relevant data. But I appreciate your perspective. I still disagree, but I respect your point of view.

          I’ll give what you’ve written some more thought and maybe respond in greater depth later but I’m getting pulled away. Just wanted to say thanks for the detailed and thorough response.