The surge in AI development has led to a desperate demand for large, high-quality training data. However, real-world data can be expensive to collect, difficult to access, and often subject to strict privacy and regulatory constraints.
Synthetic data, which consists of artificially generated records that replicate the statistical properties of real-world data without reproducing specific individuals’ information, provides an appealing solution by generating artificial datasets at scale without relying on identifiable personal information. It combines speed, cost efficiency, and regulatory compliance, making it a sensible alternative for organizations seeking to reduce risks while maintaining data utility. When properly anonymized, synthetic datasets may fall outside the scope of laws such as the EU’s General Data Protection Regulation (GDPR) or Thailand’s Personal Data Protection Act (PDPA), reducing compliance burdens while still supporting high-quality model training.
However, relying on synthetic data without rigorous legal due diligence could be a strategic mistake. It replaces one set of known risks (scraping, direct privacy liability) with a new set of complex liabilities. The narrative that synthetic data is a “silver bullet” for privacy and IP compliance is dangerous and could be misleading.
While synthetic data addresses data scarcity, it also introduces new legal uncertainties. Legal counsel should anticipate downstream risks arising from compromised data sources. Models trained on unlawfully obtained data may need to be decommissioned, even if their outputs appear lawful.
What is synthetic data?
Synthetic data refers to artificially generated information created using AI techniques such as deep learning and generative models. Instead of copying real records, it reproduces the statistical patterns and relationships found in the original dataset.
Synthetic data generally falls into three categories:
- Fully synthetic data – Entirely new data points generated from learned patterns. The model studies the structure of the original data and produces records that resemble real-world behavior without replicating any specific individual.
- Partially synthetic data – Real datasets in which sensitive fields (names, ID numbers, contact details) are replaced with artificial values while nonsensitive attributes remain intact.
- Hybrid synthetic data – A combination of real and synthetic records, often used where some genuine information must be retained for accuracy or operational purposes.
The appeal of synthetic data lies in its protection of privacy and its operational efficiency.
Properly generated synthetic datasets exclude real personal identifiers and can often be
used for development, testing, analytics, and model training without exposing the information of actual individuals. In highly regulated sectors such as healthcare and financial services, synthetic data allows organizations to work with large, realistic datasets while minimizing the legal and operational constraints associated with using real customer or patient information.
Synthetic data is often used in the following sectors:
- Healthcare: Synthetic patient records and images for safe model development.
- Finance: Simulated transactions for fraud detection and risk modeling.
- Mobility and autonomous vehicles: Generated driving scenarios to train for rare or dangerous events.
Each of these sectors leverages synthetic data to accelerate AI innovation. It provides realistic, varied training examples without leaking sensitive details.
Intellectual Property considerations
Despite the clear benefits of using synthetic data, its use for AI training may still give rise to intellectual property risks. The main concerns relate to possible infringement and whether synthetic data can be protected by copyright.
Infringement Risks Arising from the Source Data
Although synthetic data can reduce privacy exposure, it does not eliminate IP risks. Every synthetic dataset starts with the same foundational step: an AI model must first access, copy, and analyze the original “source data.” If that source data is protected by copyright or contractual terms, training on it without permission may constitute infringement.
Some stakeholders adopt a more permissive view of AI training, characterizing it as a form of computational analysis that extracts abstract statistical patterns rather than protected expressive content, and therefore does not constitute infringement. However, this view reflects a policy-based interpretation rather than settled law.
Courts and regulators have increasingly indicated that using copyrighted works for AI training may amount to prima facie infringement, unless a specific legal exception applies. Developers often invoke defenses such as U.S. fair-use principles, but these are narrow, fact-dependent, and unsettled in the context of AI.
Recent U.S. cases, such as Bartz v. Anthropic and Thomson Reuters v. ROSS, have so far found fair use only where the underlying materials were lawfully acquired and the secondary use was genuinely transformative. Conversely, they have rejected fair use where the model was trained on pirated or unauthorized copies. In practice, this means that organic (real) data collected without permission still presents a significant copyright risk for model developers.
Copyrightability of Synthetic Data: Lack of Human Authorship
Even when synthetic data does not copy any specific protected work, it raises a different issue: copyright protection generally requires human authorship. Many copyright systems require a work to result from a human’s creative expression. Authorities in the U.S., U.K. and Thailand take a similar approach: the U.S. Copyright Office has repeatedly rejected registrations for fully AI-generated works on the basis that they lack human authorship. As a result, a fully synthetic dataset produced without meaningful human creative input may not be protected by copyright at all, meaning third parties could potentially reuse it freely. Nevertheless, when meaningful human judgment is involved in designing, selecting, or arranging synthetic samples, copyright may protect that creative selection or arrangement even if the individual records themselves are not protected.
Copyrightability of Synthetic Data: Originality and the Creativity Threshold
Aside from the issue of human authorship, synthetic data often fails the originality requirement. Modern copyright law does not protect works based solely on labor or investment (“sweat of the brow doctrine”). Courts require at least a minimal degree of creativity.
In the U.S., Feist Publications v. Rural Telephone Service Co. confirmed that originality requires independent creation plus a “modicum of creativity.” EU courts apply a similar test, requiring that a work reflect the author’s “own intellectual creation.”
For synthetic data producers, this creativity threshold is difficult to meet. Many synthetic outputs simply replicate statistical patterns without meaningful human creative contribution, leaving them ineligible for copyright protection. Developers should not assume that large or expensive synthetic datasets are automatically protected. To secure such copyright protection, it is necessary to clearly document the human creative decisions involved in designing or curating the synthetic data.
Compliance considerations
Synthetic data should not be presumed to fall outside privacy regulation. Under laws such as the EU’s General Data Protection Regulation and Thailand’s Personal Data Protection Act, information still qualifies as personal data if it relates directly or indirectly to an identifiable individual. Synthetic data may still fall within this scope when it is:
- Generated from real individuals’ records,
- Capable of being linked to a person when combined with other available information, or
- Structured in a way that allows specific traits or behaviors of an individual to be inferred.
In these situations, regulators are likely to treat the synthetic dataset as containing personal data, meaning full compliance obligations still apply.
Ensuring true anonymization is technically challenging. Studies have repeatedly shown that even heavily anonymized datasets can be re-identified with the original individuals with high accuracy using only a few demographic attributes such as age, gender, and ZIP code. The same risks apply to synthetic datasets that replicate the structure of real-world data, especially in domains involving rare characteristics.
Therefore, anonymization cannot be treated as a single, conclusive action. As computational methods advance, datasets considered anonymous today may become identifiable tomorrow. Synthetic data remains a valuable tool, but organizations should deploy it with a realistic understanding of these evolving risks.
This article was originally published by Dow Jones Risk Journal in April 2026.