Benchmarking Success: Synthetic Data Matches and Surpasses Real Data in AI Training

Data Machine Intelligence (DMI) and FlySight have conducted a significant test comparing the widely used Singapore Maritime Dataset with a synthetic dataset created using the DMI Labs platform. The results are promising, showing that synthetic data can effectively match and enhance real-world data in AI training.

The Challenge:

In AI development, especially in safety-critical areas like aerospace and defense, acquiring sufficient real-world data for training can be challenging. For instance, to train AI for detecting ships in distress or for avoiding collisions of aircraft, real data in extreme conditions is needed, which is not only rare but also costly and even risky to obtain. This scarcity of data often hinders the progress of AI applications in such environments.

The Test:

Data Machine Intelligence, in partnership with FlySight, conducted a comprehensive test using three datasets: 2.893 images of real data, 4.062 images of synthetic data, and a hybrid dataset combining both. The test environments simulated various real-world conditions, including different lighting, weather effects, and dynamic camera angles. This rigorous setup aimed to evaluate the performance of YOLO-based object detectors and validate the effectiveness of synthetic data.

The Results:

The results were promising. The synthetic dataset’s training performance achieved a Mean Average Precision (MAP) of 32%, which was on par with the real dataset’s MAP of 31%. The hybrid dataset, combining real and synthetic data, achieved the best results with a MAP of 39%. This confirms that synthetic data can reliably substitute or enhance real-world data, reducing costs, increasing speed, and enabling AI training in contexts where real data is unavailable.

What the Experts say:


Niccolò Camarlinghi, Head of Research at FlySight:
“The first thing that comes to mind when someone asks us to develop an AI method is: where do we get the data for training and testing [… and] considering privacy and ownership, can I use this data […]? Synthetic data generation has the potential to address all these issues and many more. Our tests with DMI have shown that this is not just a possibility for the future—it is a reality right now.

Matteo Marone, CTO Synthetic Data at Data Machine Intelligence:
“The test shows the possibilities of synthetic data and has given us clear confirmation that […] it provides real value. […] Our mission is to accelerate the development of safe and robust AI systems – having a fine tuned data set generation engine at hand is a big step forward in this way”.
Max Najork, Co-Founder of Data Machine Intelligence:
“These results show the great impact synthetic data has for Safe AI development. It lowers the costs, and opens up training and validation opportunities for areas previously inaccessible. Reliable synthetic data generation facilitates training and retraining, enabling new players to enter the market, allowing existing systems to adapt to new areas more efficiently, and to validate systems for precisely defined operational design domains.”

Read the full press release with additional details on the test-setup on flysight’s website.

DMI will continue to work with FlySight to investigate corner cases further, and expand into other data types such as IR, Lidar, and Radar. This continued collaboration aims to refine the use of synthetic data in AI training and explore new applications and scenarios.
Reach out to us if you are interested in learning more or partnering with us.
Are you at ILA Berlin 2024? Let’s talk: visit our VR-Cockpit Demonstrator and learn more about DMI Labs, our end-to-end platform for safe AI development.
For more information, visit our ILA info page.
Scroll to Top