Overview of Statistical Learning
Statistical learning simply refers to the broad set of tools that are available for understanding data. There are two main types of statistical learning: supervised and unsupervised.
Supervised Learning
Supervised learning involves building statistical models to predict outputs \( (Y) \) from inputs \( (X) \). For example, assume that we have a salary dataset for statisticians. The dataset consists of the experience level and salary for 10 different statisticians.
Years of Experience (X) | Salary (Y) |
---|---|
0.5 | 70000 |
1.0 | 74000 |
1.5 | 75000 |
2.0 | 75000 |
2.5 | 77000 |
3.0 | 80000 |
3.5 | 78000 |
4.0 | 79000 |
4.5 | 82000 |
5.0 | 85000 |
We could build a simple linear regression model to predict the salary of statisticians by using experience level as a predictor. This is an example of supervised learning, where we have supervising outputs (salary values) that guide us in developing a statistical model to determine the relationship between experience level and salary.
In general, there are two main types of supervised learning: regression and classification.
Regression
Predicting a quantitative output is known as a regression problem. For example, predicting someone's salary is a regression problem.
Classification
Predicting a qualitative output is known as a classification problem. For example, predicting whether a stock will go up or down is a classification problem.
Unsupervised Learning
Unsupervised learning involves building statistical models to determine relationships from inputs \( (X) \). There are no supervising outputs. For example, assume that we have a customer dataset. The dataset consists of the annual salary and annual spend on Amazon for 10 different individuals.
Customer ID | Salary | Spend |
---|---|---|
1 | 54425 | 1103 |
2 | 67953 | 1353 |
3 | 53135 | 3216 |
4 | 61452 | 4146 |
5 | 59374 | 2890 |
6 | 66544 | 975 |
7 | 55550 | 1240 |
8 | 58064 | 956 |
9 | 58918 | 6791 |
10 | 54162 | 4800 |
... | ... | ... |
We could use a statistical clustering algorithm to group customers by their purchasing behavior. This is an example of unsupervised learning, where we do not have supervising outputs that already inform us which customers are low spenders, average spenders, or high spenders. Instead, we have to come up with the determination ourselves.
In general, there are two main types of unsupervised learning: clustering and association.
Clustering
Determining groupings is known as a clustering problem. For example, grouping customers together based on purchasing behavior is a clustering problem.
Association
Determining rules that describe large portions of a dataset is known as an association problem. For example, determining that people who buy X also buy Y is an association problem. A modern real-world example of this is Amazon's "frequently bought together" product recommendations.