The promise of big data is a well known one, but privacy concerns can often mean that scientists lack access to the kind of data they would like. A recent MIT led study suggests that researchers can achieve similar results with synthetic data as they can with authentic data, thus bypassing potentially tricky conversations around privacy.
The approach, which uses machine learning to automatically generate the data, was born out of a desire to support scientific efforts that are denied the data they need. Despite the synthetic data being completely different, it was still found to be useful enough to allow scientists to test their algorithms and models.
"Once we model an entire database, we can sample and recreate a synthetic version of the data that very much looks like the original database, statistically speaking," the authors say. "If the original database has some missing values and some noise in it, we also embed that noise in the synthetic version… In a way, we are using machine learning to enable machine learning."
The system is known as a Synthetic Data Vault (SDV), and it utilizes machine learning to build models out of real databases in order to create artificial data. The algorithm itself is a form of recursive conditional parameter aggregation, which exploits the hierarchical nature of data. For instance, the researchers reveal that it can easily take a customer transaction table, and then form a multivariate model for each customer based on his or her transactions.
The model is capable of capturing any correlations between fields, so for instance would capture purchase amount and type, together with the time of each transaction. After a model has been created for each customer, it can then essentially model the entire database, replicating it with reliable, if made up, data.
The team used the SDV to generate data from five publicly available datasets. They then set 39 data scientists loose on the data in order for them to develop predictive models. The aim was to see if they could spot a difference between the work produced by the scientists working with real data versus those working with synthetic data.
To the test
Each of the groups of scientists used the datasets they had to solve a predictive modeling problem, with 3 tests conducted for each dataset. When the solutions were compared, there appeared to be no difference at all in 11 of the 15 tests, which suggests that the approach has real potential.
"Using synthetic data gets rid of the 'privacy bottleneck'—so work can get started," the researchers say.
It opens up a significant range of opportunities for researchers who have previously had a shortage of data to work with.
"Companies can now take their data warehouses or databases and create synthetic versions of them," they continue. "So they can circumvent the problems currently faced by companies like Uber, and enable their data scientists to continue to design and test approaches without breaching the privacy of the real people—including their friends and family—who are using their services."
What's more, the model should be as effective on small datasets as it is on large one. This would enable very rapid development cycles for data systems. It could also prove invaluable in the education of data science students.
In other words, it gives people all of the benefits of big data, with none of the liabilities associated with it.