What is Statistical Learning?

Assume that we have an advertising dataset that consists of TV advertising spend, radio advertising spend, newspaper advertising spend, and product sales.

TV Spend Radio Spend Newspaper Spend Product Sales
230,000 37,800 69,200 22,100
44,500 39,300 45,100 10,400
17,200 45,900 69,300 9,300
151,500 41,300 58,500 18,500
180,800 10,800 58,400 12,900
8,700 48,900 75,000 7,200
57,500 32,800 23,500 11,800
120,200 19,600 11,600 13,200
8,600 2,100 1,000 4,800
199,800 2,600 21,200 10,600

We could build a multiple linear regression model to predict sales by using the different types of advertising spend as predictors.

  • Product sales would be the response variable \( (Y) \)
  • The different advertising spends would be the predictors \( (X_{1}, X_{2}, X_{3}) \)

The model could be written in the following general form:

\[ Y = f(X) + e \]

The symbol \( f \) represents the systematic information that \( X \) provides about \( Y \). Statistical learning refers to a set of approaches for estimating \( f \).

Reasons for Estimating \( f \)

There are two main reasons for estimating \( f \): prediction and inference.

Prediction

Prediction refers to any scenario in which we want to come up with an estimate for the response variable \( (Y) \).

Inference

Inference involves understanding the relationship between \( X \) and \( Y \), as opposed to predicting \( Y \). For example, we may ask the following questions:

  • Which predictors \( (X) \) are associated with the response variable \( (Y) \)?
  • What is the relationship between the predictor and response?
  • What type of model best explains the relationship?

Methods of Estimating \( f \)

There are two main methods of estimating \( f \): parametric and nonparametric.

Parametric

Parametric methods involve the following two-step model-based approach:

  1. Make an assumption about the function form (linear, lognormal, etc.) of \( f \)
  2. After selecting a model, use training data to fit or train the model

Nonparametric

On the other hand, nonparametric methods do not make explicit assumptions about the functional form of \( f \).

Instead, they attempt to get as close to the data points as possible, without being too rough or too smooth. They have the potential to accurately fit a wider range of possible shapes for \( f \). Spline models are common examples of nonparametric methods.

Parametric vs Nonparametric

Nonparametric methods have the potential to be more accurate due to being able to fit a wider range of shapes. However, they are more likely to suffer from the issue of overfitting to the data.

Additionally, parametric methods are much more interpretable than nonparametric methods. For example, it is much easier to explain the results of a linear regression model than it is to explain the results of a spline model.

Assessing Model Accuracy

In statistical learning, we're not only interested in the type of model to fit to our data, but also how well the model fits the data.

One way to determine how well a model fits is by comparing the predictions to the actual observed data.

Regression Accuracy

In regression, the most commonly used method to assess model accuracy is the mean squared error (MSE):

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{i}-\hat{y}_{i})^2 \]

  • \( n \) - represents the number of observations in the data
  • \( y_{i} \) - represents the actual response value in the data
  • \( \hat{y}_{i} \) - represents the predicted response value

Bias-Variance Tradeoff

However, we should usually be interested in the error of the test dataset instead of the training dataset. This is because we usually want the model that best predicts the future, and not the past. Additionally, more flexible models will reduce training error, but may not necessarily reduce test error. Therefore, the test error should be of higher concern.

In general, to minimize the expected test error, a model that has low variance and low bias should be chosen.

Bias refers to the error that is introduced by modeling a very complicated problem with a simplistic model. More flexible models have less bias because they are more complex.

Variance refers to how good a model is if it was used on a different training dataset. More flexible models have higher variance.

The relationship among bias, variance, and the test error is known as the bias-variance tradeoff. The challenge lies in finding the method in which both bias and variance are low.

Classification Accuracy

In classification, the most commonly used method to assess model accuracy is the error rate (the proportion of mistakes made):

\[ Error Rate = \frac{1}{n} \sum_{i=1}^{n} I(y_{i} \cancel= \hat{y}_{i}) \]

However, again, we should usually be interested in the error rate of the test dataset instead of the training dataset.

The test error rate can be minimized through a very simple classifier known as the Bayes' classifier. The Bayes' classifier assigns classes based on a known predictor value.

For example, assume that we knew for a fact that 70% of all people who make more than $100,000 per year were STEM graduates.

Salary Category % STEM
> $100,000 70%
< $100,000 30%

The Bayes' classifier predicts one class if the probability of a certain class is greater than 50%. Otherwise, it predicts the other class. The Bayes' decision boundary is the line at which the probability is exactly 50%. The classifier predictions are based on this boundary.

For example, if we were given a test dataset of just salary values, we'd simply assign any salaries greater than $100,000 as STEM graduates, and salary values less than $100,000 as non-STEM graduates.

The Bayes' classifier produces the lowest possible test error rate, called the Bayes' error rate.

In theory, we would always like to predict classifications using the Bayes' classifier. However, we do not always know the probability of a class, given some predictor value. We have to estimate this probability, and then classify the data based on the estimated probability.

K-Nearest Neighbors (KNN)

The K-Nearest Neighbors (KNN) classifier is a popular method of estimating conditional probability. Given a positive integer \( K \) and some test observation, the KNN classifier identifies the \( K \) points in the training data that are closest to the test observation. These closest \( K \) points are represented by \( N_{0} \). Then, it estimates the conditional probability for a class as the fraction of points in \( N_{0} \) that represent that specific class. Lastly, KNN will apply the Bayes' rule and classify the test observation to the class with the largest probability.

However, the choice of the \( K \) value is very important. Lower values are more flexible, whereas higher values are less flexible but have more bias. Similar to the regression setting, a bias-variance tradeoff exists.

Next Chapter →

ISLR Chapter 2 - R Code

Basic R Functions

# Assign a vector to a new object named "data_vector"
data_vector = c(1, 2, 3, 4)

# List of all objects
ls()

# Remove an object
rm(data_vector)

# Create two vectors of 50 random numbers
# with mean of 0 and standard deviation of 1
random_1 = rnorm(50, mean=0, sd=1)
random_2 = rnorm(50, mean=0, sd=1)

# Create a dataframe from the vectors
# with columns labeled as X and Y
data_frame = data.frame(X=random_1, Y=random_2)

# Get the dimensions (rows, columns) of a dataframe
dim(data_frame)
## 50 2

# Get the class types of columns in a dataframe
sapply(data_frame, class)
##         X         Y 
## "numeric" "numeric"

# Easily omit any NA values from a dataframe
data_frame = na.omit(data_frame)

# See the first 5 rows of the dataframe
head(data_frame, 5)
##            X          Y
## 1  1.3402318 -0.2318012
## 2 -1.8688186  1.0121503
## 3  2.9939211 -1.7843108
## 4 -0.9833264 -1.0518947
## 5 -1.2800747 -0.4674771

# Get a specific column (column X) from the dataframe
data_frame$X
## 1.3402318 -1.8688186 2.9939211 -0.9833264 -1.2800747

# Get the mean, variance, or standard deviation
mean(random_1)
## -0.1205189

var(random_1)
## 1.096008

sd(random_1)
## 1.046904

Plotting with ggplot

# Create a scatter plot using the ggplot2 package
plot_1 = ggplot(data_frame, aes(x=data_frame$X, y=data_frame$Y)) + # Specify the dataframe to use
  geom_point() + # Specify that this is a scatter plot
  coord_cartesian(xlim=c(-3, 3), ylim=c(-3, 3)) + # Specify the min and max axes values
  labs(title="Random X and Random Y", subtitle="Random Numbers", y="Y Random", x="X Random") +
  scale_x_continuous(breaks=seq(-3, 3, 0.5)) + # Specify the X axis tick marks
  scale_y_continuous(breaks=seq(-3, 3, 0.5)) + # Specify the Y axis tick marks
  geom_point(colour="blue") + # Specify the color of the scatter points
  geom_point(aes(col=data_frame$X)) # Varies the color of the points by category/number

# Visualize the ggplot
plot(plot_1)