As data-hungry machine learning models demand increasing amounts of information, the market for synthetic data continues to grow. But is it as good as the real deal?
One billion photos were used to train Meta’s latest photo-recognition algorithm, a powerful demonstration of the current appetite for data. For those companies without access to platforms like Instagram, there is another answer: synthetic data.
Synthetic data is artificially created by a computer, rather than collected from the real world. These computer-generated images can be automatically annotated by the machine that creates them. Annotation is an important part of AI training and is a process where important points in a photo, such as people or objects, are labelled to help the machine learning models understand what the image depicts. They also avoid any compliance or privacy-related issues by virtue of being an original picture that doesn’t feature real people.
Such technology spares companies the challenge of sourcing and collecting thousands of real-world images, while also avoiding issues around privacy, GDPR and copyright.
“AI’s biggest bottleneck is the scarcity of privacy-compliant real-world data,” says Steve Harris, CEO of UK-based synthetic data startup Mindtech Global. “Even a simple image recognition application needs up to 100,000 training images, and each image needs to be privacy-compliant and perfectly annotated by a human.” Sourcing, annotating and cleaning of real-world data is “a monumental task”, he says, which can occupy up to 80% of a data scientist’s time.
Marek Rei is a machine learning professor at Imperial College London. “Collecting manual data is time-consuming and expensive,” he says. “If you’re able to generate data from scratch, you can essentially create endless amounts of it. For some rare events, obtaining even 10 real examples can be difficult, whereas synthetic data can potentially provide unlimited examples.”
Thanks to these benefits, 60% of the data used for the development of AI and analytics projects will be synthetically generated by 2024, Gartner predicts, leading the consulting firm to describe it as “the future of AI”.
With previous AI models, the development process involved collecting the data, training the model, testing it and making any necessary changes before testing it again.
The issue with this method is that the data used stays the same, according to Ofir Chakon, CEO and co-founder of synthetic data company Datagen.
“The increase in performance that you get from this model-centric approach is relatively low,” he says. “In order to really get a significant improvement on the performance of your AI algorithms, you need to change your mindset. Instead of iterating on the model’s parameters, you need to iterate on the data itself.”
Datagen produces synthetic data for a range of AI applications, from facial recognition technology to driver monitoring systems, security cameras and even gesture recognition. Chakon believes such applications will become increasingly popular as more companies expand into the metaverse.
To produce the computer-generated data for a facial recognition system, Datagen scans the faces of real people from a range of ages and demographic groups. Based on this 3D information, its AI learns the composite parts of the human face so it can then start generating images of completely new people. “From scanning 100 base identities, we can create millions of new identities,” Chakon says.
For example, with enough information, the generative model could be asked to create a face of a 30-year-old white male with brown hair; it will spit out a completely new image each time.
“Based on what it learns from the real-world scans and the conditions that are put in, it can generate a completely new identity that’s not at all related to what was in the original collection of faces,” Chakon says.
Proponents of synthetic data say this can help reduce the bias that often infiltrates algorithms at the training stage. “Biased training data can result in technology solutions and products that reinforce and perpetuate real-world discrimination,” says Harris. “For example, AI systems have on many occasions been found to be poor at recognising darker skin tones. This is because the AI in question has been trained on datasets lacking diversity.”
In 2015, Google’s image recognition algorithm was called out for mislabelling images of black people as “gorillas”. With synthetic data, it is theoretically possible for AI developers to generate an endless number of faces of people of different ethnicities to train its models, meaning such gaps in the AI’s understanding are less likely.
Harris claims that some of its customers use Mindtech’s AI training platform Chameleon to generate diverse data from scratch, while others use it to address the lack of diversity in their existing real-world datasets. “By using computers to train AIs, we’re removing the biggest roadblock to progress: human bias.”
Computers training computers
There are inevitably issues with using computer-generated images to train AI for real-world applications. “Synthetic data almost never gives the same results as a comparable amount of real data,” Rei explains. “We normally have to make some assumptions and simplifications in order to model the data-generation process. Unfortunately, this also means losing a lot of the nuances and intricacies present in the real data.”
This is easy to identify from a cursory glance at some of the faces that have been synthetically generated – they’re unlikely to fool a person into thinking they’re real. Datagen is currently investing in its photorealism capabilities, but Chakon argues realism isn’t crucial for every application.
“If you are developing a blemish detection AI for makeup application, having detail is important,” he says. “But if you’re developing a security system, it’s much less relevant whether you can identify small details on a person’s face.”
Synthetic data also isn’t a silver bullet for AI bias; it relies on the people generating the data to use such platforms responsibly. Rei adds: “Any biases that are present in the data-generation process – whether intentionally or unintentionally – will be picked up by models trained on it.”
An Arizona State University study showed that when trained on predominantly white, male images of engineering professors, its generative model amplified the biases in the dataset, meaning that it produced images of minority modes less frequently. Even worse, the AI began “lightening the skin colour of non-white faces and transforming female facial features to be masculine” when generating new faces.
With synthetic data programmes giving developers access to unlimited amounts of data, this has the potential to drastically exacerbate the issue of bias if errors are made at any point in the generation process.
If used correctly, synthetic data may still help to improve the diversity of some datasets. “If the data distribution is very unnatural – for example, it doesn’t contain any examples of people from a particular race – then synthetically creating these examples and adding them to the data can be better than doing nothing,” Rei says. “But it will likely not be as good as collecting real data with a more accurate coverage of all races.”
While synthetic data can make the process of creating AI models quicker, cheaper and easier for programmers, it still comes with many of the same challenges as its real-world counterpart. “Whether synthetic is better than real-world data is not really the right question,” Harris argues. “What AI developers need to do is find or create adequate amounts of appropriate data to train their system.” Using a combination of both real and artificial data may be the answer.