Editor’s note: The practice of uncovering valuable insights from data has been around for centuries. In the modern enterprise, it’s called data science, which has evolved into a critical capability for companies to compete in the digital age.
In their book, “Data Science: Concepts and Practice,” authors Vijay Kotu and Bala Deshpande explain the core principles and applications of modern data science and performance analytics. Kotu is vice president of analytics at ServiceNow; Deshpande is a data scientist and consultant. This article, from the book’s introduction, has been adapted with permission.
Data science is a collection of techniques that extract value from data by finding useful patterns, connections, and relationships within it. It has become an essential tool for any organization that collects, stores, and processes data as part of its operations.
As a popular buzzword, data science spawns a wide variety of definitions and is also commonly known as knowledge discovery, machine learning, predictive analytics, and data mining. Each term, though, has a slightly different connotation, depending on the context.
Despite its growing recognition, data science’s underlying methods are decades if not centuries old. Engineers and scientists have been using predictive models since the beginning of the 19th century. Humans have always been forward-looking creatures, and predictive sciences are manifestations of this curiosity.
Almost every organization and business uses data science today. The science of data science involves employing evidence-based methods built on empirical knowledge and historical observations.
As the ability to collect, store, and process data has increased, in line with Moore’s Law— which holds that computing hardware capabilities double every two years—data science is being applied in an increasing number of fields. In previous decades, building a production-quality regression model took several dozen hours. Today, sophisticated machine learning models involving hundreds of predictors with millions of records can run on a laptop computer in a matter of seconds.
The importance of data preparation
The process involved in data science, however, has not changed since those early days and is not likely to change much in the foreseeable future. To get meaningful results from any data, a major effort preparing, cleaning, scrubbing, or standardizing the data is still required before learning algorithms can begin to crunch them.
What may change is the automation available to do that work. Currently, this process is iterative and requires analysts’ to be aware of best practices, but soon enough smart automation may become common practice. This will allow the focus to be put on the most important aspect of data science: interpreting the results of the analysis to make decisions. This will also increase the reach of data science to a wider audience.
Today, sophisticated machine learning models involving hundreds of predictors with millions of records can run on a laptop computer in a matter of seconds.
So, which data science techniques are the most important to master? The vast majority of contemporary data science practitioners use a handful of very powerful techniques, such as decision trees, regression models, deep learning, and clustering.
However, as with all 80/20 rules, the long tail, which is made up of a large number of specialized techniques, is where the value lies. Depending on what is needed, the best approach may require a relatively obscure technique or a combination of several not-so-commonly used procedures. Thus, learning data science and its methods systematically is a proven way to reap consistent rewards.