• Zier@fedia.io
    link
    fedilink
    arrow-up
    37
    ·
    3 months ago

    I guess we know what he did with all the Government database information he copied.

  • vrek@programming.dev
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    24
    ·
    3 months ago

    I mean Elon Musk is an asshole but is this really an issue? I mean there were the yellow pages which basically doxxed everyone technically…

    • AnarchistArtificer@slrpnk.net
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 months ago

      Leaking people’s personally identifiable nformation (PII) is harmful, even if this particular instance of leakage weren’t harmful.

      When proponents of AI respond to the argument from creatives that training Generative AI involves stealing creative works, they often assert that the entire method of training means that the original works are not contained within the end model, and that the process is analogous to how humans learn. In a technical sense, I do agree with this characterisation of training as a sort of informational distillation. However, it appears that there are instances where an unreasonable amount of the original work is still retained in the final model. An analogy that I’d draw here is that in determining whether a derivative work that draws on an existing one is fair use, one of the factors is how much of the original work is contained within the derivative, and in what context. If a model is able to regurgitate data that it was trained on, then morally speaking, it’s harder to justify this as being fair use (I say “morally” because I’m drawing on the ethical theme of fair use rather than using it in its straightforward legal sense). Of course, the question here isn’t about stealing of art or other copyright concerns, but considering this separate problem is useful for understanding why this leakage is problematic.

      One of the big problems with AI, whether we’re talking about training on creative works, or the leakage of PII is that these models are incredibly opaque. It is exceptionally hard, if not impossible, to determine what data from the training data has been preserved in the final model — I don’t even know whether the AI companies are able to glean that information. These models are so incredibly complex and are trained on unfathomable amounts of data, which leads to more and more instances where we see inappropriate levels of reproduction of the training data.

      The key questions are:

      • If the model can reproduce this, are there more harmful things that could plausibly be retrievable via the AI? (Given that we have been seeing models trained on extremely sensitive medical or legal data, the answer is “almost certainly”);
      • How can we know what PII or other sensitive data may have been contained in the training data? I.e. how do we gauge the extent of the severity of the risk of sensitive stuff being reproduced (Certainly we can’t, and I’m doubtful if even the engineers behind the models could effectively answer this)
      • If we know for certain that sensitive materials have been included in the training data, how do we stop (or reduce the likelihood of) that data being reproduced? Is it possible to train a general purpose AI on sensitive data without significant risk of said sensitive data being reproduced (speaking as someone who has done a lot of nitty gritty data work and coding with machine learning systems, and tries to keep up with the literature, to my knowledge, we can’t, and we might not ever be able to)

      I consider this leakage of PII to be pretty serious already, but this is just an example of why people are so concerned about these systems being rolled out in the way they have. This particular instance is barely scratching the surface of a much wider, and deeper problem

    • onnekas@sopuli.xyz
      link
      fedilink
      arrow-up
      1
      ·
      3 months ago

      I still find it crazy that those books existed in the first place. When I grew up you only needed a name and you could look it up in the yellow pages to get their phone number and address.

      However, where I lived it was possible to opt out from this.

      • NormalOnNSFW@lemmynsfw.com
        link
        fedilink
        arrow-up
        1
        ·
        3 months ago

        Back in the AOL days, the first iterations of Google had built-in white pages lookup, for everyone, where if you put in a landline phone number you’d get their name and address. One of my first experiences on the internet as a kid was talking people from AOL chatrooms into sending me their phone number, googling it, and sending back their name and address with some nonsense about being from the FBI. Really freaked people out.