26 September 2023

Why AI needs a steady diet of synthetic data

Start the conversation

Ashleigh Hollowell* says synthetic data could be the answer to more successful AI.

Artificial intelligence (AI) may be eating the world as we know it, but experts say AI itself is also starving — and needs to change its diet.

One company says synthetic data is the answer.

“Data is food for AI, but AI today is underfed and malnourished,” said Kevin McNamara, CEO and founder of synthetic data platform provider, Parallel Domain, which just raised $30 million in a series B round led by March Capital.

“That’s why things are growing slowly.

“But if we can feed that AI better, models will grow faster and in a healthier way.

“Synthetic data is like nourishment for training AI.”

Research has shown that about 90 per cent of AI and machine learning (ML) deployments fail.

A Datagen report from earlier this year pointed out that a lot of failure is due to the lack of training data.

It found that 99 per cent of computer vision professionals say they have had an ML project axed specifically because of the lack of data to see it through.

Even the projects that aren’t fully cancelled for lack of data experience significant delays, knocking them off track, 100 per cent of respondents reported.

In that vein, Gartner predicts synthetic data will increasingly be used as a supplement for AI and ML training purposes.

The research giant projects that by 2024 synthetic data will be used to accelerate 60 per cent of AI projects.

Synthetic data is generated by machine learning algorithms that ingest real data to train on behavioural patterns and create simulated data that retains the statistical properties of the original dataset.

The resulting data replicates real-world circumstances, but unlike standard anonymized datasets, it’s not vulnerable to the same flaws as real data.

Pulling AI out of the ‘Stone Age’

It may sound unusual to hear that a technology as advanced as AI is stuck in a “Stone Age” of sorts, but that’s what McNamara sees — and without adoption of synthetic data, it will stay that way, he says.

“Right now AI development is kind of the way computer programming was in the ‘60s or ‘70s when people used punch card programming — a manual, labor-intensive process,” he said.

“Well, the world eventually moved away from this and to digital programming.

“We want to do that for AI development.”

The three biggest bottlenecks keeping AI in the Stone Age are the following, according to McNamara:

  1. Collecting real-world data — which is not always feasible.

Even for something like jaywalking, which happens fairly often in cities around the world, if you need millions of examples to train your algorithm, that quickly becomes unattainable for companies to go out and get from the real world.

  1. Labelling — which often requires thousands of hours of human time and can be inaccurate because, well, humans make errors.
  2. Iterating on the data once it is labelled — which requires you to adjust sensor configurations etc.

and then apply it to actually begin to train your AI.

“That whole process is so slow,” McNamara said.

“If you can change those things really fast, you can actually discover better setups and better ways to develop your AI in the first place.”

Enter stage right: Synthetic data

Parallel Domain works by generating virtual worlds based off of maps, which it dubs “digital cousins” of real-world scenarios and geographies.

These worlds can be altered and manipulated to, for instance, have more jaywalking or rain, to aid with training autonomous vehicles.

Because the worlds are digital cousins and not digital twins, customization can simulate the sometimes harder-to-obtain — but essential for training — data that companies normally would have to go out and get themselves.

The platform allows users to tailor it to their needs via an API, so they can move or manipulate factors precisely the way they want.

This accelerates the AI training process and removes roadblocks of time and labor.

The company claims that in a matter of hours it can provide training datasets that are ready for its customers to use — customers that include the Toyota Research Institute, Google, Continental and Woven Planet.

“Customers can go into the simulated world and make things happen or pull data from that world,” McNamara said.

“We have knobs for different kinds of categories of assets and scenarios that could happen, as well as ways for customers to plug in their own logic for what they see, where they see it and how those things behave.”

Then, customers need a way to pull data from that world into the configuration that matches their setup, he explained.

“Our sensor configuration tools and label configuration tools allow us to replicate the exact camera setup or the exact lidar and radar and labeling setup that a customer would see,” he said.

Synthetic data, generative AI

Not only is synthetic data useful for AI and ML model training, it can be applied to make generative AI — an already rapidly growing use of the technology — develop even faster.

Parallel Domain is eyeing the field as the company enters 2023 with fresh capital.

It hopes to multiply the data that generative AI needs to train, so it can become an even more powerful tool for content creation.

Its R&D team is focusing on the variety and detail in the synthetic data simulations it can provide.

“I’m excited about generative AI in our space,” McNamara said.

“We’re not here to create an artistic interpretation of the world.

“We’re here to actually create a digital cousin of the world.

“I think generative AI is really powerful in looking at examples of images from around the world, then pulling those in and creating interesting examples and novel information inside of synthetic data.

“Because of that, generative AI will be a large part of the technology advancements that we’re investing in for the coming year.”

The value of synthetic data isn’t limited to AI.

Given the vast amount of data needed to create realistic virtual environments, it’s also the only practical approach to move the metaverse forward.

Parallel Domain is part of the fast-growing synthetic data startup sector, which Crunchbase previously reported is seeing a swath of funding.

Datagen, Gretel AI and Mostly AI are some of its competitors that have also raised multiple millions in the last year.

*Ashleigh Hollowell is a professional journalist who thoughtfully utilizes resources, analysis, creativity, and integrity.

This article first appeared at venturebeat.com

Start the conversation

Be among the first to get all the Public Sector and Defence news and views that matter.

Subscribe now and receive the latest news, delivered free to your inbox.

By submitting your email address you are agreeing to Region Group's terms and conditions and privacy policy.