Facebook parent-company Meta is currently fighting a class action lawsuit alleging copyright infringement and unfair competition, among others, with regards to how it trained LLaMA. According to an X (formerly Twitter) post by vx-underground, court records reveal that the social media company used pirated torrents to download 81.7TB of data from shadow libraries including Anna’s Archive, Z-Library, and LibGen. It then used this information to train its AI models.
The evidence, in the form of written communication, shows the researchers’ concerns about Meta’s use of pirated materials. One senior AI researcher said way back in October 2022, “I don’t think we should use pirated material. I really need to draw a line here.” While another one said, “Using pirated material should be beyond our ethical threshold,” then they added, “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”
Then, in January 2023, Mark Zuckerberg himself attended a meeting where he said, “We need to move this stuff forward… we need to find a way to unblock all this.” Some three months later, a Meta employee sent a message to another one saying they were concerned about Meta IP addresses being used “to load through pirate content.” They also added, “torrenting from a corporate laptop doesn’t feel right,” followed by laughing out loud emoji.
Aside from those messages, documents also revealed that the company took steps so that its infrastructure wasn’t used in these downloading and seeding operations so that the activity wouldn’t be traced back to Meta. The court documents say that this constitutes evidence of Meta’s unlawful activity, which seems like it’s taking deliberate steps to circumvent copyright laws.
However, this isn’t the first time an AI training model has been accused of stealing information off the internet. OpenAI has been sued by novelists as far back as June 2023 for using their books to train its large language models, with The New York Times following suit in December. Nvidia has also been on the receiving end of a lawsuit filed by writers for using 196,640 books to train its NeMo model, which has since been taken down. A former Nvidia employee blew the whistle on the company in August of last year, saying that it scraped more than 426 thousand hours of videos daily for use in AI training. More recently, OpenAI is investigating if DeepSeek illegally obtained data from ChatGPT, which just shows how ironic things can get.
The case against Meta is still ongoing, so we will have to wait until the court releases its decision to say if the company committed direct infringement. And even if the writers win this case, Meta, with its huge financial war chest, will likely appeal the decision, meaning we will have to wait for several months, if not years, to see the final court judgment.