Synthetic Data: The Good, the Bad and the Unsorted | by Tea Mustać | Jan, 2024

Editor
4 Min Read


So, is synthetic data a friend or a foe? It is neither and it is both. Truth be told, here we have a classic example of a double-edged sword. Synthetic data creates new problems while resolving some of the existing ones. And this doesn’t hold true just for privacy, it also holds true for performance goals, where for instance scalability and data augmentation can stand opposite of bias amplification or generalization concerns. This is not a reason for either giving up or regurgitating the same old pro vs con type of articles and analyses, that either overgeneralize or focus on just one minuscule point of the larger picture. Which also renders anyone reading a particular article blind to the forest behind the tree.

The utility and appropriateness of the use of synthetic data in the process of training ML models will always depend on the particular circumstances of the case. It will depend on the type of data we need to train the model (personal, copyright protected, highly sensitive), the quantity of the necessary data, the availability of the data, and the intended purpose of the model (as inaccuracy or bias amplification will carry different weights in models assessing creditworthiness or those for optimizing the supply chain). So maybe we can start by answering these kinds of questions for any given context and then proceed to consider the various existing tradeoffs in a more appropriate setting.

Key takeaways:

· Synthetic data is never pseudonymous.

· Synthetic data should always be anonymous.

· Synthetic data does not solely revolve around privacy.

· Although always helping preserve privacy, synthetic data causes other data protection issues.

· Privacy and data protection are not the same thing.

· Some data protection issues also happen to be performance issues. This is good because it means we are all (at least sometimes) trying to fix the same thing.

· All tradeoffs associated with synthetic data are very context-specific and should be discussed within their relevant context.

A banana on a table and an image of a banana on a laptop on the same table. Each of the two bananas has a white frame around it with the word ‚Banana‘ sticked on top of it
Max Gruber / Better Images of AI / Ceci n’est pas une banane / CC-BY 4.0

[1] Exploring Synthetic Data: Advantages and Use Cases, Intuit Mailchimp, https://mailchimp.com/resources/what-is-synthetic-data/

[2] John Anthony R, When It Comes To AI — Synthetic Data Has A Dirty Little Secret, https://www.linkedin.com/pulse/when-comes-aisynthetic-data-has-dirty-little-secret-radosta/

[3] Michael Yurushkin, How Can Synthetic Data Solve the AI Bias Problem?, brouton lab blog, https://broutonlab.com/blog/ai-bias-solved-with-synthetic-data-generation/

[4] Giuffrè, M., Shung, D.L. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digit. Med. 6, 186 (2023). https://doi.org/10.1038/s41746-023-00927-3

[5] GDPR

[6] AEDP, 10 MISUNDERSTANDINGS RELATED TO ANONYMISATION, https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf

[7] Recital 26 GDPR

[8] AEDP, 10 MISUNDERSTANDINGS RELATED TO ANONYMISATION, https://edps.europa.eu/system/files/2021-04/21-04-27_aepd-edps_anonymisation_en_5.pdf

[9] Robert Riemann, Synthetic Data, European Data Protection Supervisor.

[10] Alex Hern, ‘Anonymised’ data can never be totally anonymous, says study, The Guardian, 23 of July 2019, https://www.theguardian.com/technology/2019/jul/23/anonymised-data-never-be-anonymous-enough-study-finds ; Emily M Weitzenboeck, Pierre Lison, Malgorzata Cyndecka, Malcolm Langford, The GDPR and unstructured data: is anonymization possible?, International Data Privacy Law, Volume 12, Issue 3, August 2022, Pages 184–206, https://doi.org/10.1093/idpl/ipac008

[11] H. Deng, Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer,

Geneva, Switzerland: UNIDIR, 2023, https://unidir.org/wp-content/uploads/2023/11/UNIDIR_Exploring_Synthetic_Data_for_Artificial_Intelligence_and_Autonomous_Systems_A_Primer.pdf .

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.