Editor’s note: In their book, “Data Science: Concepts and Practice,” authors Vijay Kotu and Bala Deshpande explain the core principles and applications of modern data science. Kotu is vice president of analytics at ServiceNow; Deshpande is a data scientist and consultant. This article, which focuses on the key characteristics and features of data science, is adapted with permission.
In the past few decades, a massive accumulation of data has coincided with the advancement of information technology, connected networks, and the businesses they enable. This trend is coupled with a steep decline in data storage and data processing costs.
The applications built on these advancements, such as digital businesses, social networking, and mobile technologies, unleash a large amount of complex, heterogeneous data that is waiting to be analyzed. Traditional analysis techniques like dimensional slicing, hypothesis testing, and descriptive statistics can only go so far in information discovery.
A paradigm is needed to manage the massive volume of data, explore the inter-relationships of thousands of variables, and deploy machine learning algorithms to deduce optimal insights from datasets. A set of frameworks, tools, and techniques are needed to intelligently assist humans to process all this data and extract valuable information. Data science is one such paradigm that can handle large volumes with multiple attributes and deploy complex algorithms to search for patterns from data.
The sheer volume of data captured by organizations is increasing exponentially. The rapid decline in storage costs and advancements in capturing every transaction and event, combined with the business need to extract as much leverage as possible using data, creates a strong motivation to store more data than ever.
Data science is one such paradigm that can handle large volumes with multiple attributes and deploy complex algorithms to search for patterns from data.
As data becomes more granular, the need to use large-volume data to extract information increases. A rapid increase in the volume of data exposes the limitations of current analysis methodologies. In a few implementations, the time to create generalization models is critical, and data volume plays a major part in determining the time frame of development and deployment.
The three core characteristics of the Big Data phenomenon are high volume, high velocity, and high variety. The variety of data relates to the multiple types of values (numerical, categorical), formats of data (audio files, video files), and the application of the data (location coordinates, graph data).
Every single record or data point contains multiple attributes or variables to provide con- text for the record. For example, every user record of an ecommerce site can contain attributes such as products viewed, products purchased, user demographics, frequency of purchase, clickstream, etc.
Determining the most effective offer for an ecommerce user can involve computing information across these attributes. Each attribute can be thought of as a dimension in the data space. The user record has multiple attributes and can be visualized in multi- dimensional space. The addition of each dimension increases the complexity of analysis techniques.
A simple linear regression model that has one input dimension is relatively easy to build compared to multiple linear regression models with multiple dimensions. As the dimensional space of data increases, a scalable framework that can work well with multiple data types and multiple attributes is needed. In the case of text mining, a document or article becomes a data point with each unique word as a dimension. Text mining yields a dataset where the number of attributes can range from a few hundred to hundreds of thousands of attributes.