Big data analysis has never been easy. Due to its complicated and often text-heavy form, big data presents researchers with major issues when they attempt to create computer models that can easily and accurately sort such a variety of inputs.

However, Massachusetts Institute of Technology graduate student Max Kanter and his thesis advisor, Kalyan Veeramachaneni, have come one step closer to cracking the code with their “Data Science Machine” (DSM).

The DSM aims to perform the function of a researcher by deciding itself which features of a big data database to look at. This feature selection can be automated by using techniques of “feature engineering.”


Feature engineering can be defined as the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.  (Tomasz Malisiewicz).

Although feature engineering is not the most popular of all IT subjects, many point out that it is crucial to AI and machine learning.

Principal analyst at Tirias Research Jim McGregor claims that the advancement of the DSM “…is really about deep learning– the ability of server platforms to analyze data and develop intelligent algorithms.”

Using feature engineering, the DSM is able to find structural relationships inherent in the database design by tracking correlations between data in different tables. The DSM then imports data from one table into a second, examines potential associations, and generates feature candidates. From there it can begin to discover values such as the minima of averages and the average of sums.

The DSM is also able to search for and register categorical data.

After it has isolated particular candidates, the DSM weeds through them and isolates only those with actual correlations.

When it has finally obtained a reduced set of potential features, the DMC then tests those features on sample data, mixing and matching them to optimize the accuracy of the resulting predictions.

feature engineering

For time sensitive projects, the DSM could make all the difference.

The DSM needs only 12 hours to produce prediction algorithms that would take months for teams of humans to create. Although the DSM’s prototypes have been only 87-96% as accurate as human predictive sets, use of the machine can at the very least cut the time required for a project.

“Think about what it would take to develop a drug for a supervirus,” McGregor reminds us. “You don’t have months, you have days before a pandemic breaks out… It’s not about finding the right answers, but finding the potential answers while eliminating many or most of the wrong ones.”

Although the DSM has great potential in the field of big data, some researchers worry about the loss of skill that often accompanies the development of automated methods.

Rob Enderle, principal analyst of the Enderle Group, had this to say: “A critical flaw in a future system could go unnoticed… and lead to catastrophic consequences.”

Hopefully Kander’s and Veeramachaneni’s Data Science Machine will be a gift to the big data community, and not a curse.