Synthetic data’s potential to train AI systems is vast, and it helps to break down barriers to innovation. How can businesses access precisely annotated and privacy-compliant data?
In 2006, the British mathematician Clive Humfry coined the now-familiar phrase, ‘Data is the new oil’. However, 16 years from his pronouncement, the realities and consequences of this “new oil” have caused many to turn to a renewable alternative to power AI: synthetic data.
Move over software; it’s now artificial intelligence “eating the world” – and its seemingly insatiable appetite for training data means developers urgently need a new source of accurate, privacy-compliant data to fuel their models. Training a simple visual recognition AI requires upwards of 100,000 perfectly-annotated, privacy-compliant images. The challenge for AI developers then is where to source this data – and in high-enough volume.
The companies with the greatest access to real-world data – Google, Apple, Meta, et al. – don’t tend to make their data sets available to other companies. This leaves most companies developing AI short of the privacy-compliant and diverse training data they need to deliver smarter, safer fairer AI systems.
Synthetic data is data generated by a computer to train an AI. The beauty of synthetic data is that it’s just data to an AI. Generating data using a computer solves the thorny issue of privacy compliance (there are no ‘real’ people involved), and it allows developers to broaden the diversity of their data while addressing edge cases – scenarios that are difficult, dangerous or impossible to gather real-world training data on.
It is well established that real-world data can reflect and perpetuate systemic bias within our societies. All too often, AI systems struggle to recognise darker skin tones. This is why we’re so focused at Mindtech on enabling our customers to create synthetic training data that identifies the diversity of people and build AI systems that protect and serve people equally.
The hidden AI roadblock
Whilst the potential for AI is high, barriers to training these artificial intelligences to operate correctly are impeding innovation and leading to sub-standard, often biased, AI solutions.
Gartner reports that engineers and data scientists spend up to 80% of their time gathering, cleaning, and manually annotating real-world data. This makes real-world data a scarce resource, and finding it in sufficient quantity and quality has become the biggest hidden roadblock to AI’s progress.
Synthetic data’s potential to train artificial intelligence systems is vast, and it helps to break down the barriers to AI innovation.With it, businesses can gain access to data that is automatically precisely annotated and privacy-compliant. And with the use of a synthetic data creation platform, they’re able to generate data in almost unlimited quantities.
Applying synthetic data to the real world
Training AI systems require data to define the model’s parameters and determine function. As a result, historical biases which have left minority groups marginalised need to be removed from training data to ensure non-biased systems.
Biases in data such as an all-white sample size, westernised features, or a lack of skin tone scale in recognition software can profoundly affect the efficacy and fairness of AI systems. It is up to us to make sure these systemic issues do not become ingrained into the fabric of the digital world as they have done in the real world. Google’s new Monk Skin Tone is one way of addressing this, but more needs to be done.
Unlimited, renewable data
To create this synthetic data, developers need an artificially-generated 3D world to simulate situations and scenarios. With the power of these virtual scenes, businesses can test out scenarios involving millions of automated, customisable avatars. These avatars are able to automatically act out the various situations needed and generate the vast amounts of training data required to train AI systems.
This is why we created the Chameleon platform. With the use of the platform, businesses can make millions of customisable actors of every age, size, colour and gender, to play out almost unlimited scenarios so AI systems can be better trained.
Solving a systemic problem
Whilst society is making progress in tackling discrimination, poor quality and biased real-world datasets risk amplifying and perpetuating inequality. We need to find new ways to ensure the innovations of the future do not fall foul of bias that belongs in the past. To this developers need to embrace the use of synthetic training data. By using computers to train AIs, we are able to remove the biggest roadblock of all… human bias.
Visit AITechPark for cutting-edge Tech Trends around AI, ML, Cybersecurity, along with AITech News, and timely updates from industry professionals!