Research Papers on AI and Copyright Litigation
AI and Copyright Litigation: Technical Research Roundup
Recent technical advancements in AI and Large Language Models (LLMs) are rapidly shaping the landscape of copyright litigation. The following roundup summarizes key research papers that provide empirical evidence and technical insights directly relevant to issues of training data sourcing, infringement detection, and attribution challenges.
1. Content Monitoring and Safety Bypass
- Early Signs of Steganographic Capabilities in Frontier LLMs
- Authors: Zolkowski Artur
- Finding: Frontier LLMs exhibit early signs of steganographic capabilities, allowing them to encode hidden, potentially malicious information within seemingly benign text outputs, posing a critical threat to content monitoring and safety systems. Tags: safety_bypass
2. Training Data and Licensing
-
The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
-
Authors: Gienapp Lukas
-
Finding: The creation and release of ‘The German Commons,’ a 154 billion token corpus of openly licensed text, proves that large-scale, legally permissible training data can be sourced for LLM development. Tags: licensing, memorization, safety_bypass, training_data, empirical_evidence
-
Teaching Models to Understand (but not Generate) High-risk Data
-
Authors: Wang Ryan
-
Finding: A novel data filtering strategy allows LLMs to understand high-risk content (including copyrighted text) without being able to generate it, providing a technical path for safety and compliance during pre-training. Tags: market_harm, safety_bypass, training_data, empirical_evidence
-
Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
-
Authors: Hahm Sungeun
-
Finding: Developed Thunder-DeID, an accurate and efficient de-identification framework for Korean court judgments, successfully balancing public access to legal data with personal data protection mandates. Tags: training_data, copyright_theory, licensing, empirical_evidence
-
Authors: Teklehaymanot Hailay Kidu
-
Finding: Tokenization disparities act as ‘infrastructure bias,’ leading to higher computational costs and lower performance for low-resource languages, creating inequities in LLM access and deployment. Tags: training_data, empirical_evidence
-
Authors: Xing Shuo
-
Finding: Confirms the ‘LLM Brain Rot Hypothesis’: continual exposure to junk web text induces lasting cognitive decline in LLMs, underscoring the necessity of high-quality, curated training data. Tags: safety_bypass, training_data, infringement_detection, empirical_evidence
3. Infringement Detection and Attribution
-
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
-
Authors: Minder Julian
-
Finding: Narrow fine-tuning leaves ‘clearly readable traces’ in the activation differences of LLMs, suggesting that specific training data inputs can be forensically linked to internal model changes. Tags: infringement_detection, technical_verification, training_data, data_removal, safety_bypass, empirical_evidence
-
How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study
-
Authors: Dubois Matthieu
-
Finding: The choice of sampling strategy (e.g., nucleus sampling vs. greedy decoding) significantly impacts the detectability of machine-written texts, making reliable detection challenging for current forensic tools. Tags: training_data, infringement_detection, attribution, empirical_evidence
-
LLM one-shot style transfer for Authorship Attribution and Verification
-
Authors: Miralles-González Pablo
-
Finding: LLM one-shot style transfer significantly degrades the performance of authorship attribution and verification models, making it difficult to distinguish between human-written and style-transferred AI-generated text. Tags: infringement_detection, attribution, knowledge_distillation, empirical_evidence, stylistic_mimicry
4. Model Behavior and Output Quality
-
Authors: Ji Shihao
-
Finding: LLM hallucination is fundamentally linked to the Transformer’s Softmax function, which creates ‘Artificial Certainty.’ The proposed Credal Transformer offers a principled way to quantify and mitigate this uncertainty. Tags: copyright_theory, empirical_evidence
-
LLM Probability Concentration: How Alignment Shrinks the Generative Horizon
-
Authors: Yang Chenghao
-
Finding: Alignment techniques cause ‘probability concentration’ in LLMs, shrinking the generative horizon and resulting in outputs that lack diversity, potentially limiting creative applications. Tags: empirical_evidence, stylistic_mimicry