Data science problems can also be classified into tasks such as classification, regression, association analysis, clustering, anomaly detection, recommendation engines, feature selection, time series forecasting, deep learning, and text mining.
- Classification and regression techniques predict a target variable based on input variables. The prediction is based on a generalized model built from a previously known dataset. In regression tasks, the output variable is numeric (e.g., the mortgage interest rate on a loan). Classification tasks predict output variables, which are categorical or polynomial (e.g., the yes or no decision to approve a loan).
- Deep learning is a machine learning subfield that involves the use of more sophisticated artificial neural networks. It’s increasingly used for classification and regression problems.
- Clustering is the process of identifying the natural groupings in a dataset. For example, clustering is helpful in finding natural clusters in customer datasets, which can be used for market segmentation. Since this is unsupervised technique, it is up to data scientists to investigate why these clusters are formed in the data and generalize the uniqueness of each cluster.
- Association analysis. In retail analytics, it is common to identify pairs of items that are purchased together, so those specific items can be bundled or placed next to each other. This task is called market basket analysis or association analysis, which is commonly used in cross-selling.
- Recommendation engines are the systems that recommend items to the users based on individual user preference.
- Anomaly or outlier detection identifies the data points that are significantly dif- ferent from other data points in a dataset. One common application is detecting credit card transaction fraud.
- Time series forecasting is the process of predicting the future value of a variable (e.g., temperature) based on past historical values that may exhibit a trend and seasonality.
- Text mining is a data science application where the input data is text in the form of documents, messages, emails, or web pages. To aid the data science on text data, the text files are first converted into document vectors where each unique word is an attribute. Once the text file is converted to document vectors, standard data science tasks such as classification, clustering, etc., can be applied.
- Feature selection is a process in which attributes in a dataset are reduced to a few attributes that really matter.
A complete data science application can contain elements of both supervised and unsupervised techniques. Unsupervised techniques provide an increased understanding of the dataset and hence, are sometimes called descriptive data science.
As an example of how both unsupervised and supervised data science can be combined in an application, consider the following scenario.
In marketing analytics, clustering can be used to find the natural clusters in customer records. Each customer is assigned a cluster label at the end of the clustering process. A labeled customer dataset can now be used to develop a model that assigns a cluster label for any new customer record with a supervised classification technique.