Compliance and Alignment: Ensuring Generative AI Stays Within the Bounds of Fair Use
Written by Lucas Coughlin
Depicted as a graph, the historical rate of human information transfer begins to climb following the printing press, grows steeper in the 1800s, takes a sharp rise around the 1950s, and ends with a huge leap at the dawn of the internet. [1] U.S. copyright law, having been codified in the era before the most recent information transfer explosion, is constantly tested by unforeseen modes of sharing, reusing, and repurposing information. [2]
The advent of generative AI proposes yet another expansion of the US copyright law’s interpretation. Though the balance between AI’s immense promise and the threat it poses to artists reflects the motivating concerns of copyright law, its novelty and power complicate any analogy to earlier cases.
This post will evaluate whether AI content generation fits under the current interpretation of Fair Use. Though the highly variable output of these programs prevents an absolute conclusion, I argue that publicly available AI content generation systems run a significant risk of copyright violation. Despite this risk, practical considerations may prevent effective legal regulation of these platforms. A better alternative is a self-imposed regulatory scheme that combines legal compliance with the more theoretical discipline of AI alignment. The technology, incentive, and cultural norms for this kind of self-regulation already exist, and its implementation could help preserve a continued role for human artists.
Technical Background
Most AI content generation platforms are neural networks. [3] These neural networks convert images, text, or other data into a set of numbers, pass these sets through algebraic functions (known as parameters) that condense them to a predetermined size, and translate the output from numbers back to a desired medium. [4] In a training setting, the actual output is measured against an ideal output, and the parameters are adjusted based on the degree of error; these adjustments are the “learning” aspect referred to in the AI context. [5]
By the time a neural network reaches the public, it will have undergone many thousands of rounds of training, and the response to a given input might have been shaped by millions of examples. [6] For example, an image-generating system might be trained to associate the text input “dog” with every example of the species available on Google, Flickr, Pinterest, Facebook, and wherever else images are available. [7]
While the use of copyrighted data for archetypes and ideas might be fair use in isolation, the complaint in Andersen v. Midjourney alleges that these systems are also learning the specific style and manner of artists. [8] Attributes like pattern, color scheme, and perspective, among other stylistic marks, may be protected by copyright. [9] The propensity of AI programs to replicate these elements is a potential source of trouble.
Fair Use Factors
The Copyright Act lists four non-exhaustive and overlapping factors that weigh whether an allegedly infringing use of copyrighted material is fair use and therefore protected. [10] Courts typically give the most weight to the first factor, the purpose and character of the use, and the fourth factor, the market harm inherent in the use. [11] Since infringement analyses are fact-dependent, any ruling would depend on the facts surrounding an AI platform’s output and training data. [12] The purpose of my analysis, limited to the most salient factors, is to suggest that these platforms create a substantial risk of infringement.
Factor 1: Purpose and Character
A first defense that AI may attempt to make against copyright infringement is that the Fair Use statute explicitly permits use for the purposes of learning, scholarship, and research. [13] This covers the early incarnations of machine learning models that were confined to computer science departments and is commonly cited as a likely protection for such models in the days before AI platforms had achieved mainstream use. But arguments that these programs are themselves “learning” or “researching” and thus entitled to fair use protections, dubious already in their blatant anthropomorphism, are made irrelevant by the commercialization taking place as the industry expands. [14] This commercialization is not fatal to a fair use analysis but closes off an easy escape route and colors further analysis under this factor and those succeeding. [15]
AI content generation would fit more plausibly under the purpose of use theories of productive or transformative use. These interpretations of § 107(1), elaborated respectively in Sony Corp. of Am. v. Universal City Studios, Inc. and Campbell v. Acuff-Rose, offer two related routes to establish a new purpose or character. [16] Productive use depends on furthering a social end like those mentioned in §107(1). The court in Campbell expanded on this interpretation and found that §107.1 was met where a work was repurposed toward some socially beneficial end. [17]
AI’s potentially epoch-defining benefits would seem to demand favorable analysis under a productive use analysis. But laudable social purpose is not a trump card over all other concerns. In American Geophysical Union v. Texaco, Inc, for example, the Second Circuit court found that the publisher’s rights outweighed the potential harm to burgeoning photocopy technology. [18] That court found that direct copying on a large scale could not fit under the educational use exception. [19] This ruling, handed down despite the non-commercial nature of the copying, suggests that innovative and useful technology is not productive and therefore transformative per se. [20] Though AI may be more transformative in most cases, its’ increasing commercialization shows reason for heightened scrutiny. [21] Furthermore, the social benefits of generative AI are more speculative than proven, and artists could plausibly argue that the possible diminution of human artists should count against the purpose and character of these programs.
“Transformative use” analysis, deriving from Campbell v. Acuff-Rose and featuring prominently in successive cases, strengthens the case for copying where the final product is categorically different from the original. Its scope is illustrated by the Second Circuit’s ruling in Author’s Guild v. Google, Inc., where Google’s wholesale duplication of copyrighted books for partial reproduction as “snippets” was held to be transformative and to be fair use. [22]
“Transformative” in this context means that the product adds something that was not part of the previous work. [23] AI platforms themselves certainly offer the public “something new,” and the stream of weighted equations that produce content is at least partially created by the copied works in their training sets. However, it seems unlikely that any argument could induce a judge to ignore a suspicious similarity between an AI output and copyrighted materials. Even the permissive standard set by HathiTrust would not permit a commercially available reproduction that infringed a copyrighted work by copying distinctive styles or attributes. [24] The Andersen case and other examples from the internet show that these platforms possess the capacity for such reproduction, creating a risk that cannot be explained away as transformative use. [25]
Factor 4: Market Harms
Market harm refers generally to a potentially infringing work replacing demand for the original, copyrighted work. In reference to AI, these fears are no longer not merely speculative. The website Buzzfeed, famous for its short, pop-culture-focused blurbs and quizzes, made headlines by announcing that it planned to replace many of its writers with Chat GPT. [26] For illustration and design, MidJourney’s premium plan allows users to use AI-generated creations on commercial sites. [27] While courts are not in the business of pure protection, data like this will strengthen presumptions that AI content is generated for a commercial purpose and replicate their training data in important ways.
Courts have disagreed on the strength of the fourth factor, but all agree that it weighs heavily on the Fair Use determination. [28] Diminished sales and decreased opportunities, in conjunction with prevalent AI-generated content, would offer compelling evidence of replacement. Even if a plaintiff could not show direct replacement, lost opportunities to license or otherwise profit by derivative uses also weigh for plaintiffs under this factor. [29] Buzzfeed’s story shows how the direct competition between humans and machines in some areas may not be a fair fight, and the increasing sophistication of these programs suggests that more replacement is inevitable. [30] As automation becomes more common, so too will evidence of direct replacement by programs engaged in unlicensed copying.
Possible Solution: Copyright Alignment
Though AI platforms seem to bear copyright liability risks, prospects of enforcement are diminished by the diffuse nature of the internet and the high cost of copyright enforcement against diffuse parties. Hardline rulings against AI platforms might induce flight of this productive industry to other countries and would do little to prevent computer-savvy individuals across the world from distributing similar models based on open-source code. [31] A more likely solution is that AI companies, wishing to respect intellectual property and preserve the role of human artists, will self-regulate their platform’s use of copyrighted material as part of their overall mission for human-friendly or “aligned” AI. [32]
Incentives to Self-Regulate
AI systems rely on human-created works, and their capacity to process more of these works has coincided with more sophisticated and useful outputs. [33] As the amount of available work declines, some researchers have speculated that the rate of improvement in these programs will be frustrated by a lack of new examples. [34] If human creative work is disincentivized by marketplace preference for AI, the ill effects of a restrained training set may be compounded.
Creators and executives in AI are also weary of potential social risks posed by their products. The worry manifests in AI’s alignment research subfield, which is primarily dedicated to studying how AI systems behave and how to prevent their engaging in misanthropic behavior. This field studies possible harms that could result from AI systems and how they can be mitigated. Though its primary focus is on scenarios like an AI system independently accessing nuclear codes, there is no reason to exclude more immediate concerns from consideration as these platforms develop. [35] Art and creative work have profound social importance, and discouraging creative thinking by automation could impose unforeseen consequences on the future of humanity. [36]
A thoughtful compliance regime could prevent direct copying without imposing heavy burdens on AI systems. Part of the solution might be to configure AI platforms so that they are biased away from directly reproducing instances of training data. A recent paper by engineers Giorgio Franceschelli and Mirco Musolesi shows that language-learning models can be trained to avoid outputting materials that are within a specified degree of similarity to copyrighted training data. [37] Though the experiments described in the paper were conducted on a language model much smaller than any commercially available model, and though the mathematical definition the paper uses for copyright violation might not withstand judicial scrutiny, the experiment nonetheless shows the feasibility of guardrails that protect against direct infringement of copyrighted works. [38]
This solution may decrease the risk of an offending end product but does not address the underlying copying to which many artists object. As opposed to a system that makes close copies of its training data, a system that self-regulates against near reproduction of inputs improves its case for Fair Use. Still, the vagaries of the Fair Use doctrine, the strength of the market harms factor, and imperfections in the implementation of an AI system’s potential guardrails would all mean a continued threat of liability. Aside from legal risks, socially conscious AI companies might try to maintain good relationships with human artists and respect their preferences by allowing them to “opt-out” of their data scraping operations. At least one AI company has signaled a willingness to honor artists who have added their works to a “do not scrape” database. [39] Though this shifts the burden to artists where it should arguably be on the copying party, it is much easier for an artist to identify her own works than for employees at an AI company to check the copyright status of millions of images and texts. [40] This method would also only work on AI companies conscientious enough to respect artists’ rights. As AI models multiply, the likelihood of widespread compliance with such self-imposed limitations diminishes. As of today, however, the high computing power needed to run a commercially available AI restricts such platforms’ operation to a few companies. These companies may be responsible for establishing the norms and culture, along with the code, that shape the future of AI.
References
[1] See generally JAMES GLEICK, THE INFORMATION: A HISTORY, A THEORY, A FLOOD (2011).
[2] 17 U.S.C. § 107.
[3] See generally Justin Wang, DALL-E: Creating Image from Text, OPENAI (Jan. 5, 2021), https://openai.com/research/dall-e [https://perma.cc/6VAC-J799] (describing the technical foundations of one of the most popular image generation platforms).
[4] Harsha Bommana, How Neural Networks Work, MEDIUM (May 26, 2019), https://medium.com/deep-learning-demystified/introduction-to-neural-networks-part-1-e13f132c6d7e [https://perma.cc/H7BM-X3CV].
[5] Id.
[6] Id.
[7] Gonzalo Martinez et al., Combining Generative Artificial Intelligence (AI) and the Internet: Heading Towards Evolution or Degradation?, ARXIV (Feb. 17, 2023), https://aps.arxiv.org/abs/2303.01255 [https://perma.cc/Q5X7-K4CW].
[8] Andersen et al. v. Stability AI Ltd. et al., No. 3:23-cv-00201 (N.D. Cal. Jan. 13, 2023).
[9] Malden Mills Inc. v. Regency Mills, Inc., 626 F.2d 1112 (2d Cir. 1980) (finding substantial similarity and ultimately infringement where a second fabric design used the same technique, balance of colors, shading, and appropriation of blank space as the first).
[10] 17 U.S.C. § 107.
[11] See generally Campbell v. Acuff-Rose Inc., 510 U.S. 569 (1984) (writing much more analysis on the first and fourth factors than the second and third).
[12] See Am. Geophysical Union v. Texaco, Inc., 60 F.3d 913, 921 (2d Cir. 1994) (“Fair use is a doctrine the application of which always depends on consideration of the precise facts at hand”).
[13] 17 U.S.C. § 107.
[14] See, e.g., Introducing ChatGPT Plus, OPENAI (Feb. 1, 2023), https://openai.com/blog/chatgpt-plus [https://perma.cc/3KUZ-NNSB] (announcing a pay model for the most popular AI text generation platform); see Arthur R. Miller, Copyright Protection for Computer Programs, Databases, and Computer-Generated Works: Is Anything New Since Contu?, 106 HARV. L. REV. 977, 1065 (1993) (arguing that computer generated work falls outside the purpose of copyright law).
[15] 17 U.S.C. § 107(1); see Campbell, 510 U.S. at 584-85 (1994) (finding that commercial use of copyright material is not presumptively unfair).
[16] Sony Corp. of Am. v. Universal City Studios, Inc., 464 U.S. 417, 479 (1984) (Blackmun, J., dissenting) (arguing that time-delayed videotaping should not be seen as fair use because, unlike the exceptions listed in § 107(1), such use is not “productive” toward any positive social end).
[17] Campbell, 510 U.S. at 574.
[18] Texaco, Inc., 60 F.3d at 921.
[19] Id.
[20] Id.
[21] 17 U.S.C. § 107(1) (and accompanying commentary).
[22] Author’s Guild v. Google, 804 F.3d 202, 232 (2d Cir. 2015) (discussing Campbell, 510 U.S. at 590).
[23] See id. at 232 (“Transformation” is not meant to be taken literally but stands in for creative repurposing).
[24] Id. at 218 (distinguishing between a purpose that augments the original and one that supersedes it); see Dr. Seuss Enters., L.P. v. Penguin Books USA, Inc., 109 F.3d 1395 (9th Cir. 1997) (Dr. Suess’ distinct style of drawing is copyrighted).
[25] Andersen et al., No. 3:23-cv-00201.
[26] Alexandra Bruell, BuzzFeed to Use ChatGPT Creator OpenAI to Help Create Quizzes and Other Content, WALL ST. J. (Jan. 26, 2023).
[27] Midjourney’s Documentation and User Guide, MIDJOURNEY, https://docs.midjourney.com/v1/en [https://perma.cc/7GGP-3T3E] (last visited Mar. 10, 2023).
[28] See Harper & Row Publishers v. Nation Enters., 471 U.S. 539, 566 (1985) (holding the fourth factor to be the most important).
[29] Id. at 568 (plaintiff only needs to show that the allegedly infringing product creates potential market harm).
[30] See Bruell, supra note 26; see also Kevin Roose, An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy, N.Y. TIMES (Sept. 2, 2022) (describing an AI image being chosen as the best entry in a prestigious art competition).
[31] Natalia Butrym, 1st Open Source AI Image Generator – Stable Diffusion, DEEP-IMAGE.AI (Oct. 25, 2022), https://deep-image.ai/blog/1st-open-source-ai-image-generator-stable-diffusion/ [https://perma.cc/T9P8-3XTP].
[32] Jan Leike et al., Our Approach to Alignment Research, OPENAI (Aug. 24, 2022), https://openai.com/blog/our-approach-to-alignment-research [https://perma.cc/3BUY-YWJ6] (Outlining alignment research generally and the specific approach of one of the most prominent AI platforms).
[33] Matt White, A Timeline of Generative AI, MEDIUM (Jan. 7, 2023), https://matthewdwhite.medium.com/a-brief-history-of-generative-ai-cb1837e67106 [https://perma.cc/MDB2-9LRM].
[34] See Martinez, supra note 7.
[35] See, e.g., Roman V. Yampolskiy, Utility Function Security in Artificially Intelligent Agents, J. EXPERIMENTAL & THEORETICAL A.I. 373, 373-89 (Apr. 8, 2014) (example of main thrust of alignment research).
[36] See, e.g., Khalil Radwa et al., The Link Between Creativity, Cognition, and Creative Drives and Underlying Neural Mechanisms, 13 FRONTIERS IN NEURAL CIRS. 1 (2019) (noting the role of creative activity in improving overall health and cognitive function).
[37] Giorgio Franceschelli & Mirco Musolesi, Copyright in Generative Deep Learning, arXiv preprint arXiv:2105.09266v4 (2021), https://arxiv.org/abs/2105.09266 [https://perma.cc/XWM6-73X2].
[38] Id.
[39] HAVEIBEENTRAINED.COM, https://haveibeentrained.com [https://perma.cc/L25J-VPEC] (last accessed Mar. 23, 2023) (allows artists to search whether their work has been used for training and to add their name to an “opt-out” list).
[40] See Sayash Kapoor & Arvind Narayand, Artists Can Now Opt Out of Generative AI. It’s Not Enough, AI SNAKE OIL (Mar. 9, 2023), https://aisnakeoil.substack.com/p/artists-can-now-opt-out-of-generative [https://perma.cc/ZYG7-P6ZD] (arguing that an “opt-out” option for artists externalizes costs that are more fairly borne by AI companies).