Some Context for AI Developers

Disclaimer: This post is not intended to be legal advice from the author.

AI developers focus on different types of work. Some focus on model creation; some optimize AI models; some focus on prompt engineering. Everyone needs lots and lots of data. With the flurry of lawsuits filed in the past year or so, developers have wondered, “Am I doing something that could get me or my company sued?” Companies at the forefront of the AI boom, such as OpenAI and Midjourney, have been sued in multiple jurisdictions for claims such as copyright infringement, trademark infringement, unfair competition, etc. This article provides some legal context for developers seeking clarity about where the pitfalls are in current AI development.

Image-based AI generation

Image-based AI generators have been sued in 2023 for a variety of claims. For example, with respect to image-based generative AI, Getty Images has sued Stability AI for the following causes of action [1] :

Copyright infringement
Violations of Digital Millennium Copyright Act (“DMCA”)
Trademark infringement under federal law
Unfair competition under federal and state laws
Trademark dilution under federal and state laws

Stability AI, Midjourney, and DeviantArt were sued by individual plaintiffs for the following causes of action [2]:

Copyright infringement, direct and vicarious
Violations of Digital Millennium Copyright Act
Violations of rights of publicity
Unfair Competition under federal and state laws

The copyright violations focus on the defendants’ use, reproduction, and storage of copyrighted images to train AI-based image generation products. Removing the copyright management information (“CMI”) from the copyrighted images or causing the output images to omit such CMI was alleged to violate the DMCA. With respect to direct copyright infringement, Getty has alleged that the outputs of the defendants’ AI tools are essentially copies of stock Getty Images [3]. One of the biggest hurdles for Getty will be showing the frequency of copying and number of “copies” of images generated by the defendants’ AI tools, which are most interesting for their ability to generate image composites. That is, the most interesting AI-generated images are rarely obvious copies of copyrighted images [4]. They are seemingly whimsical combinations of images, styles, backgrounds, colors, and contraptions, limited only by the imagination of the prompter. Finally, Getty will need to get past the latest jurisprudence expanding the fair use doctrine [5]. AI models are trained on so many images that most resultant image composites may not meet the “substantial similarity test” required under copyright law.

If the copyright-focused claims fail, plaintiffs must rely on unfair competition, which hinges on a defendant’s conduct which seeks to gain advantage over rivals through misleading, deceptive, dishonest, or fraudulent conduct in trade or commerce. Central to this question will be whether posting images online provides implicit permission for third parties to scrape those images in order to train AI models. Reasonable minds may disagree about this, but the answer to this question will determine where AI developers can go for “clean” image datasets to train their AI models.

Text-based AI generation

In 2022, Microsoft, Github and Open AI were sued in a class-action case that has garnered much press attention [6]. The causes of action are as follows:

Violations of the Digital Millennium Copyright Act
Breach of contract
Tortious interference in a contractual relationship
Fraud
Trademark violations
Unjust enrichment
Unfair competition
Violations of the California Consumer Privacy Act (“CCPA”)
Negligent handling of personal data

Here, the Copilot product, developed by GitHub, OpenAI, and Microsoft, assists coders by autocompleting certain source code (suggesting individual lines of code and whole functions). GitHub claims that Copilot was trained on public code. The crux of the complaint is that OpenAI took code stored on GitHub and used it as training data to build Copilot and Codex. The code was subject to certain licensing and attribution requirements that were allegedly ignored when used in the training data.

Interestingly, the plaintiffs did not assert claims for copyright infringement. This makes sense as the plaintiffs’ code was released under open source licenses, which generally do not restrict how the code can be used. Thus, plaintiffs could not now claim that using their code as training data constituted copyright infringement. Instead, plaintiffs focused on the fact that derivative works of their code required attribution, but no such attribution was provided by Copilot.

Recently, the court dismissed the claims focused on CCPA, tortious interference, false designation of origin, fraud, breach of terms of service, unfair competition, negligence, and civil conspiracy claims. The court allowed to proceed certain claims concerning the DMCA and breach of license agreement. Ultimately, the remaining claims appear to focus on issues like copyright attribution and the permissibility of using third party data to train AI models in the absence of explicit permission from such third parties.

Takeaways

A number of sources have provided advice to AI developers. For example, Meta has provided the following instruction to developers in connection with its release of Llama 2 [7].

Developing downstream applications of LLMs begins with taking steps to consider the potential limitations, privacy implications, and representativeness of data for a specific use case. Begin by preparing and preprocessing a clean dataset that is representative of the target domain.

This somewhat begs the question: Where do I get a clean dataset? Companies seem to be getting sued for using datasets that are “unclean” due to the inclusion of personal data, the exclusion of CMI, and/or open source attributions. At bottom, the “cleanest” datasets are specifically licensed for use in AI training or development. You should ask questions like the following:

Who compiled this dataset?
What are the terms of use for this dataset? Is it exclusively for research purposes?
Is the dataset licensed by your company, and if yes, what are the restrictions and/or requirements?
For what reason was the dataset compiled? Is there a geographic limitation of the data? (For example, personal data of EU citizens receives special protections regardless of where the data ends up)
Are there indications of ownership in the dataset, such as watermarks, digital fingerprints, metadata, or other technological protections, and if so, what is their legal import?

As a developer, you should understand the provenance of the datasets you are using. Data that is scraped from the web bears the risk of future lawsuits. Whether you or your company can bear such risk depends on the robustness of your resources.

In addition to paying attention to the cleanliness of your datasets, you should try to obtain copyright registrations where possible and as timely as possible [8]. At minimum, this provides objective evidence of your ownership of your AI model(s), which could be important evidence in the event you are sued. You should discuss with an attorney frequently as the law continues to evolve. For example, the Copyright Office recently made clear that although AI models cannot be considered authors of a copyrighted work, AI content in a work must be disclosed in an application for copyright. This recent Q&A with the Copyright Office provides a useful example [9]:

Question: … Caitlyn wrote the text and the images that illustrate her story, and the images that illustrate her story were generated by AI. How should Caitlyn complete the (copyright) application?
Answer: Caitlyn can register her original text, but the AI generated artwork needs to be disclaimed.

In general, we would advise in favor of consulting an attorney about intellectual property issues (from permission to use third party datasets to ownership of resulting work product), contract terms and limitations, and risk mitigation steps. Good luck as you continue your development work [10] !

Acknowledgements: Many thanks to the following individuals for contributing to this post: Aleksandra Podgorska, Ana Sofia Vazquez, Caroline Vazquez, Elodie Migliore, Marine Lipartia, Tammy Zhu.

References

[1] Getty Images (US) Inc. v. Stability AI Inc., D.Ct. of Del, No. 1:23-cv-00135

[2] Andersen v. Stability AI Ltd, N.D. Cal., No. 3:23-cv-00201

[3] “Stable Diffusion at times produces images that are highly similar to and derivative of the Getty Images proprietary content that Stability AI copied extensively in the course of training the model. Indeed, independent researchers have observed that Stable Diffusion sometimes memorizes and regenerates specific images that were used to train the model.” Complaint at para. 51.

[4] See also, Downing, Kate, “Battle of the AI Analogies,” June 21, 2023 post located at https://katedowninglaw.com/2023/06/21/battle-of-the-ai-analogies (distinguishing AI models from photocopiers or VCRs and explaining that “the vast majority of the output is not, in fact, a copy of the original, meaning that it’s even easier to find noninfringing uses with respect to ML/AI models than with any other piece of technology designed to create near exact copies”).

[5] See, e.g., Google LLC v. Oracle America, Inc., 593 US ___ (2021), located at https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf ; see also Myers, Gary, “Muddy Waters: Fair Use Implications of Google LLC v Oracle America, Inc.”, Northwestern Journal of Technology and Intellectual Property, Vol. 19, Issue 2, located at https://scholarlycommons.law.northwestern.edu/cgi/viewcontent.cgi?article=1353&context=njtip (“It does appear that copyright protection for works that have a functional element will be narrower and more readily constrained under the fair use analysis”).

[6] Doe 1 et al v. GitHub, Inc. GitHub, Microsoft (owner of GitHub), and OpenAI, et al, N.D. Cal. Case No.: 4:2022cv06823.

[7] Llama 2 Responsible Use Guide, located at https://scontent-lax3-1.xx.fbcdn.net/v/t39.8562-6/365271716_1020127512503290_6433760642443597145_n.pdf?_nc_cat=102&ccb=1-7&_nc_sid=ae5e01&_nc_ohc=TnsYhTfqE28AX8MtpD-&_nc_ht=scontent-lax3-1.xx&oh=00_AfDKehVlbAX5Kt0SFXYeAOaz-xZTk07sKAl5EJgOpw7lDQ&oe=64D3B33A.

[8] Copyright laws provide a full range of remedies against infringers, including injunctive relief, destruction of infringing articles, and recovery of damages and profits, as well as recovery of statutory damages, court costs and attorneys fees. A copyright owner’s failure to register copyrights on time will bar recovery of statutory damages and attorneys fees. See Southern Credentialing Support Servs., LLC v Hammond Surgical Hosp., LLC, 2020 WL 104342 (Jan. 9, 2020) (5th Cir.) (barring statutory damages and damages for post-registration infringements).

[9] See Transcript of June 28, 2023 webinar titled “Registration Guidance for Works Containing AI-generated Content”, located at https://www.copyright.gov/events/ai-application-process/Registration-of-Works-with-AI-Transcript.pdf.

[10] For further advice, see Zhu, Tammy, “What to Negotiate Before Deploying AI Coding Assistants at Work,” located at https://news.bloomberglaw.com/us-law-week/what-to-negotiate-before-deploying-ai-coding-assistants-at-work.

Jenny LeeAugust 11, 2023copyright, licensing, ai, developers, data sources, llm