We’re witnessing a massive revolution in artificial intelligence. Tools like ChatGPT, Midjourney, Stable Diffusion, Dall-E and Copilot allow users to generate text, imagery, computer code and other content based solely on natural language prompts. The creativity, detail and human-like quality of the AI outputs are astonishing, leading to widespread excitement, fear, and ire. Thinking machines force us to question long-held assumptions about the supposed uniqueness of human ingenuity and ponder whether AI is bringing us closer to utopia, human obsolescence, or something in between. Thankfully, this is just a law blog, so we can leave these existential questions about the coming singularity for an episode of Black Mirror.
The legal implications of AI are varied, evolving, and complex, but the focus of this post is AI’s intersection with intellectual property—in particular, the IP and IP-adjacent issues raised by AI’s input—i.e., the content used to train AI. (A different set of issues relate to the IP protection, or lack thereof, afforded to the output of AI. We’ve previously blogged about the output issue, which is not the focus of this post.).
The fundamental question is this: do content creators have the right to authorize or block AI systems from collecting and using their content as training data? This question implicates not only complex issues of law but important matters of public policy. What rules around AI will optimally “promote the Progress of Science and useful Arts,” the prime directive of the U.S. Constitution’s copyright mandate?
These questions are at the center of two class action lawsuits brought against generative AI providers accused of ingesting content without permission. Both cases were filed in the United States District Court for the Northern District of California by the same lawyers (the Joseph Saveri Law Firm and Matthew Butterick). A third lawsuit brought by Getty Images in the United Kingdom raises similar claims.
By establishing the legal parameters of AI machine learning, these disputes have the potential to profoundly impact the acceleration, adoption, and quality of AI systems. Below is a summary of the three lawsuits followed by a discussion of some key legal issues—namely, copyright, web scraping, and copyright preemption—likely to be fiercely litigated in these cases.
The Three New AI Cases
1/ Doe v. GitHub, 22 Civ. 6823 (N.D. Cal. Nov. 10, 2022)
In the first action, a putative class of developers sued GitHub (a popular open-source code repository), OpenAI (owner of GPT-3, the language model behind ChatGPT) and Microsoft (GitHub’s owner and OpenAI’s lead investor). The action targets Copilot, a subscription-based AI tool co-developed by GitHub and OpenAI using GPT-3 that “promises to assist software coders by providing or filling in blocks of code using AI.” The developers claim that the defendants used copyright protected source code, which the developers posted to GitHub under various open-source licenses, as training data for Copilot.
The complaint asserts various causes of action, including violations of the “copyright management information” section of the Digital Millennium Copyright Act (for removing CMI from the source code), 17 U.S.C. § 1202; and breach of contract (for failing to provide attribution, copyright notices, and a copy of the source code’s open-source license). Interestingly, the complaint does not allege straight copyright infringement.
Based on the current schedule, the deadline for the defendant’s response—likely, a motion to dismiss—is January 26, 2023. The developers’ opposition is due March 9, 2023 and the reply is due April 6, 2023. We’ll keep you posted as this potentially landmark case develops.
2/ Andersen v. Stability AI et al., No. 23 Civ. 201 (N.D. Cal. Jan. 13, 2023)
The second case, brought by three visual artists, targets generative art AI tools Stable AI (the developer behind Stable Diffusion), MidJourney (a popular image generator) and Deviant Art (developer of the DreamUp app). Similar to the GitHub case, the artists allege that the defendants used the artists’ works without permission to train AI systems, which then use the machine learning to generate new, and allegedly infringing, derivative works. Unlike the GitHub case, the complaint against the AI art generators asserts claims for copyright infringement (under both direct and vicarious theories of liability). Interestingly, the copyright claims take aim at the ability of certain tools to create art “in the style of” a particular artist. Defendants have not yet appeared in this brand new case.
3/ Getty Images v. Stable Diffusion
On January 17, 2023, stock image licensor Getty Images announced that it was filing its own lawsuit against Stability AI in the United Kingdom. At the time of publication, the complaint was not yet available. But in a press statement, Getty Images argued that “Stability AI unlawfully copied and processed millions of images protected by copyright and the associated metadata owned or represented by Getty Images absent a license to benefit Stability AI’s commercial interests and to the detriment of the content creators.” Getty further stated that it offers licenses “related to training artificial intelligence systems in a manner that respects personal and intellectual property rights” but that, instead of obtaining one, Stability AI pursued its own “stand‑alone commercial interests.” Getty Images suggested the complaint would include claims for copyright infringement and unlawful web scraping.
Copyright and Fair Use
Copyright will be central to all three cases, even if it is not a direct claim in the GitHub case.
In the United States, copyright subsists in original works of authorship fixed in tangible media of expression. This covers all the content at issue in the three lawsuits, including visual designs, photographs, and source code. Putting aside contractual or other rights that may protect these works (more on this below), the extent to which AI generators may use copyright protected works without permission likely will come down to a question of fair use. The determination of that question almost certainly will involve an examination of three landmark fair use cases. Each has obvious import to the issue of unauthorized machine learning.
a/ The Authors Guild v. HathiTrust (2d Cir. 2014) and Google (2d Cir. 2015)
In two cases, an authors’ rights organization and group of individual authors sued Google and several research libraries for copyright infringement after they scanned and indexed millions of copyright protected books for the purpose of making the books searchable online. The Second Circuit found fair use in both cases. It did not matter that millions of copyright protected works were used without the permission of their authors or that Google sought to monetize these works. In HathiTrust, the court held that “the creation of a full‐text searchable database is a quintessentially transformative use” because it does not “supersede the objects or purposes of the original creation,” but rather adds “something new with a different purpose and a different character.” In the case against Google, the court held that “Google’s making of a digital copy to provide a search function . . . augments public knowledge by making available information about [p]laintiffs’ books without providing the public with a substantial substitute for matter protected by the [p]laintiffs’ copyright interests in the original works or derivatives of them.” The HathiTrust court rejected the authors’ argument that the libraries’ unauthorized use of their books deprived the authors of a licensing opportunity, holding that “the full‐text search function does not serve as a substitute for the books that are being searched,” so it was “irrelevant that the Libraries might be willing to purchase licenses in order to engage in their transformative use (if the use were deemed unfair).”
(Disclosure: Our firm represented The Authors Guild in these cases.)
b/ Oracle v. Google (U.S. 2021)
In this case, Oracle, which owns the copyright in the Java programming language, sued Google for copyright infringement based upon Google’s unauthorized use of roughly 11,500 lines of code from Java SE, which were part of an application programming interface (API), to build Google’s Android mobile operating system. The lawsuit considered whether the API code was subject to copyright protection and whether, if it was, Google’s use constituted fair use. The Supreme Court held that, even assuming that the API code is copyrightable (the Court did not answer the question), Google’s use of the API code was fair use. The Court looked closely at the nature of the API code, finding that its purpose—allowing programmers to access other code—distinguished it from other more “expressive” code and favored fair use. The Court also found that Google’s use of the API code to reimplement a user interface was transformative because it would further the development of computer programs, thereby fulfilling copyright’s prime directive.
c/ Andy Warhol v. Goldsmith (2d Cir. 2020), cert granted
In this case, the Supreme Court is considering the application and continued viability of fair use’s “transformative use” test, which the Supreme Court established nearly thirty years ago in Campbell v. Acuff-Rose. In 1981, photographer Lynn Goldsmith took a photograph of Prince. Andy Warhol created a series of silkscreen prints and illustrations based on the photograph. Warhol’s works made some visual changes to the photograph, but they remained “recognizably derived” from the original. Goldsmith sued the Andy Warhol Foundation for copyright infringement. Reversing the district court, the Second Circuit held that that the Warhol series was not sufficiently transformative to constitute fair use. The Foundation appealed, and the Supreme Court agreed to hear the case. Both the factual context of the case—an artistic take of a pre-existing work—and the broader doctrinal question—the delineation or replacement of the transformative use test—could have significant implications for cases considering the inputs and outputs of AI systems.
“Need more input.” - Johnny #5, Short Circuit
AI has a rapacious appetite for information. But where and how do AI systems get their treasured input? Do AI systems need permission from content owners to collect and use their content as training data? Should they need permission?
Fortunately, there is a mountain of helpful precedent in the realm of web scraping—the act of harvesting content from web sites for use in third-party applications. Unfortunately, the law governing web scraping is nuanced and fact intensive, implicating numerous overlapping legal doctrines. Complaints against web scrapers frequently include claims under the Computer Fraud and Abuse Act (“CFAA”) (for gaining unauthorized access to a computer system), breach of contract (for violating terms of service), trespass to chattels (for entering virtual property without permission), copyright infringement (for reproducing protected content), and unfair trade practices.
Outcomes in web scraping cases are mixed. As a general matter, web scraping has been a critical component, and unavoidable reality, of web development since the beginning of the Internet. Most platforms tolerate, and sometimes even embrace, web scraping, and typically have taken legal action only when scrapers engage in abusive conduct by circumventing technical controls, over-taxing a platform’s servers, or using a platform’s own data to compete directly against it.
The common law rules that have emerged around web scraping provide useful guidance in assessing the boundaries of AI content ingestion, and the cases that established those rules will undoubtedly play an important role in the recently filed AI cases. But the more interesting and vexing question is one of policy: Will we as a society benefit more by restricting or maximizing the public data sources available to AI systems? Should publicly available information be open for AI consumption because it will make the tools that much better, or should folks be able to control whether their data is ingested notwithstanding the potential disservice to the public interest?
In a recent ruling rejecting CFAA claims that LinkedIn brought against a web scraping competitor, the Ninth Circuit recognized this important policy question:
We agree with the district court that giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data—data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use—risks the possible creation of information monopolies that would disserve the public interest.
HiQ Labs, Inc. v. LinkedIn Corp. (9th Cir. 2019). While this decision related only to the CFAA claim on a motion for preliminary injunction (the breach of contract claims remained intact and ultimately were successful), the court’s acknowledgement of the public interest in “maximizing the free flow of information on the Internet” is nonetheless noteworthy. In a case involving web scraping, not copyright, the Ninth Circuit echoed the Constitutional directive to “promote the Progress of Science and useful Arts,” aligned with the reasoning of the fair use decisions discussed above, and provided a guiding principle for courts adjudicating disputes involving machine learning.
A doctrine less sexy than fair use, but no less significant to the issues raised by unauthorized AI machine learning, is copyright preemption. As suggested by the web scraping cases discussed above, copyright is not the only tool available to protect content. Online content is frequently published on websites governed by terms of service, and those terms typically purport to restrict commercial use and web scraping of their content. When someone collects and uses content in violation of those terms, the argument goes, it is a breach of contract, even if the use would not constitute copyright infringement (because the use is licensed or fair use, for example). Copyright preemption is often an important defense to that claim.
Section 301(a) of the Copyright Act provides: “all legal or equitable rights that are equivalent to any of the exclusive rights within the general scope of copyright . . . are governed exclusively by this title.” Based on this statute, courts have frequently dismissed unjust enrichment, right of publicity, and other state law claims as preempted by the Copyright Act. If a claim is held preempted, the defendant then may assert the panoply of defenses available under the Copyright Act, including express or implied license, lack of copyrightability and fair use.
Are claims for breach of contract based upon use of content in violation of terms of service preempted by the Copyright Act? Sometimes yes, sometimes no. Courts have really struggled with this question. In many cases, including in the Second Circuit, the outcome turns on whether the breach claim is found to include any “extra elements that make it qualitatively different from a copyright infringement claim.” Briarpatch Ltd., L.P. v. Phoenix Pictures, Inc., 373 F.3d 296 (2d Cir. 2004). For example, a contractual promise to pay has been found to constitute an “extra element” that precludes copyright preemption.
The Second Circuit found no such “extra element” in ML Genius Holdings LLC v. Google LLC (2d Cir. 2022). In that case, song lyrics website Genius.com sued Google for scraping and displaying its lyrics in search results, allegedly in violation of Genius’ terms of service. Genius did not (and could not) sue for copyright infringement because Genius does not hold the copyright to the lyrics. However, Genius’ terms of service prohibit users from commercially reproducing or distributing any portion of the content posted to Genius. Genius sued Google for breaching these terms, but the Second Circuit held that the claim was preempted, finding the breach claim was not “qualitatively different from a copyright claim.”
Genius petitioned the Supreme Court to review the decision, arguing that “[t]he circuits are intractably split” on the Copyright Act’s preemption of breach-of-contract claims. In response, Google argued that Genius is improperly attempting to use its terms of service to “invent new rights” that are equivalent to, and preempted by, copyright. On December 12, 2022, the Supreme Court invited the U.S. Solicitor General to file a brief sharing it views on the case, signaling a possible grant of certiorari. We blogged about this case here.
Depending on the outcome of the anticipated fair use defenses, Genius (especially if SCOTUS grants cert) and other copyright preemption cases may play a pivotal role in determining the viability of state law claims arising from the collection and use of publicly available data for AI machine learning.
This is all happening extremely fast. A few years ago, AI-generated works were strictly the domain of cutting-edge artists, academics, and copyright scholars. Suddenly, in the last few months, AI tools have become ubiquitous—not just for experimentation and media hype, but for commercial use, education, and beyond. The law always struggles to keep up with technology, and AI is no exception. When I started drafting this post, I wrote in hypotheticals—i.e., “it is only a matter of time before artists sue one of these tools for copyright infringement.” Mid-post, news of the Stability Diffusion case hit. Then Getty Images announced its lawsuit. Like AI, the questions were no longer academic.
Given the speed of AI’s development, AI’s never-ending need for “more input,” and the profound impact that AI already has had, and increasingly will have, on the creative industry, I have no doubt that this is just the beginning. While new technology brings a certain level of uncertainty, we are fortunate to have centuries of precedent draw upon to help us navigate the coming onslaught. Hopefully, in establishing the rules of the road for AI content ingestion, policymakers and courts will consider the impact their decisions will have not just on private parties before them, but on society’s overriding interest in “Progress of Science and useful Arts.”
The fundamental question is this: do content creators have the right to authorize or block AI systems from collecting and using their content as training data?