In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.

So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn’t be able to sell it for that purpose.

  • FaceDeer
    link
    fedilink
    24 months ago

    Which is why nobody trains on ONLY AI generated data.

    Really, experts have thought of this stuff already. Because they’re experts. Synthetic data means that the amount of “real” data required is much less, so giant repositories like Reddit aren’t so important.

    • @Natanael@slrpnk.net
      link
      fedilink
      14 months ago

      No, “much less” training data isn’t possible with synthetic data. That’s not what it’s there for. The experts would tell you as much if you asked them.