Chapter 10: Simple Linear Regression | CFA Level 1

1

Introduction to Linear Regression

Simple linear regression is a statistical method used to model the relationship between two continuous variables. In finance, it's widely used for predicting returns, analyzing risk factors, and understanding relationships between economic variables.

Key Concepts

Dependent Variable (Y): The variable we want to predict or explain
Independent Variable (X): The variable used to make predictions
Linear Relationship: A relationship that can be represented by a straight line
Regression Line: The best-fitting line through the data points

Applications in Finance

Modeling stock returns against market returns (Beta calculation)
Analyzing the relationship between interest rates and bond prices
Predicting company earnings based on economic indicators
Portfolio performance attribution

Financial Example

A portfolio manager might use regression to analyze how a stock's returns relate to market returns, helping to calculate the stock's beta and systematic risk.

2

The Simple Linear Regression Model

The simple linear regression model describes the relationship between a dependent variable Y and a single independent variable X using a linear equation.

Population Regression Model

$$Y_i = \alpha + \beta X_i + \epsilon_i$$

Where:

Y_i = Value of dependent variable for observation i
α (alpha) = Population intercept (Y-intercept)
β (beta) = Population slope coefficient
X_i = Value of independent variable for observation i
ε_i (epsilon) = Random error term for observation i

Sample Regression Model

$$\hat{Y}_i = a + b X_i$$

Where:

Ŷ_i = Predicted value of Y for observation i
a = Sample intercept (estimate of α)
b = Sample slope coefficient (estimate of β)

Interpretation of Coefficients

Intercept (a): Expected value of Y when X = 0
Slope (b): Change in Y for a one-unit increase in X

Financial Interpretation Example

If we regress a stock's returns against market returns and get:

$$\text{Stock Return} = 0.02 + 1.3 \times \text{Market Return}$$

Intercept (0.02): Expected stock return when market return is 0%
Slope (1.3): Stock's beta; for every 1% increase in market return, stock return increases by 1.3%

3

Least Squares Method

The least squares method finds the regression line that minimizes the sum of squared differences between actual and predicted values. This provides the "best fit" line through the data points.

Objective Function

Minimize the sum of squared residuals:

$$\text{Minimize: } \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - a - bX_i)^2$$

Normal Equations

Taking partial derivatives and setting them to zero gives us:

$$b = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}$$

$$a = \bar{Y} - b\bar{X}$$

Where:

X̄ = Sample mean of X values
Ȳ = Sample mean of Y values
S_XY = Sample covariance between X and Y
S_XX = Sample variance of X

Alternative Formula for Slope

$$b = r_{XY} \times \frac{s_Y}{s_X}$$

Where r_XY is the correlation coefficient between X and Y.

Calculation Example

Given data for 5 companies' advertising spending (X) and sales revenue (Y):

Company	Advertising ($000)	Sales ($000)
A	10	120
B	15	150
C	20	180
D	25	210
E	30	240

Calculations:

X̄ = 20, Ȳ = 180
S_XX = 250, S_XY = 1000
b = 1000/250 = 4
a = 180 - 4(20) = 100

Regression equation: Ŷ = 100 + 4X

4

Properties of Regression Coefficients

Understanding the statistical properties of regression coefficients is crucial for proper interpretation and inference in financial analysis.

Properties of the Slope Coefficient (b)

Unbiased: E[b] = β (expected value equals true population parameter)
Consistent: Converges to true value as sample size increases
Normally distributed: Under certain conditions

Standard Error of the Slope

$$SE(b) = \frac{s_e}{\sqrt{S_{XX}}} = \frac{s_e}{\sqrt{\sum(X_i - \bar{X})^2}}$$

Where s_e is the standard error of the estimate:

$$s_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} = \sqrt{\frac{SSE}{n-2}}$$

Standard Error of the Intercept

$$SE(a) = s_e \sqrt{\frac{1}{n} + \frac{\bar{X}^2}{S_{XX}}}$$

Confidence Intervals

For the slope coefficient:

$$b \pm t_{\alpha/2, n-2} \times SE(b)$$

For the intercept:

$$a \pm t_{\alpha/2, n-2} \times SE(a)$$

Degrees of Freedom

In simple linear regression, we use (n-2) degrees of freedom because we estimate two parameters (intercept and slope) from the data.

5

Regression Model Assumptions

Linear regression relies on several key assumptions. Violations of these assumptions can lead to biased estimates and invalid statistical inferences.

The Five Key Assumptions

1. Linearity

The relationship between X and Y is linear
E[Y|X] = α + βX
Check: Scatter plot of Y vs X should show linear pattern

2. Independence

Observations are independent of each other
Error terms are uncorrelated: Cov(εᵢ, εⱼ) = 0 for i ≠ j
Common violation: Time series data with autocorrelation

3. Homoscedasticity (Constant Variance)

Error terms have constant variance: Var(εᵢ) = σ² for all i
Check: Plot residuals vs fitted values
Violation: Heteroscedasticity (changing variance)

4. Normality

Error terms are normally distributed: εᵢ ~ N(0, σ²)
Required for hypothesis testing and confidence intervals
Check: Q-Q plot of residuals

5. No Perfect Multicollinearity

Not directly applicable to simple regression
X values must show variation (not all identical)
Becomes critical in multiple regression

Assumption Violations in Finance

Time Series Data: Often violates independence assumption due to autocorrelation
Heteroscedasticity: Common in financial data where volatility changes over time
Non-linearity: Option prices vs underlying asset prices show non-linear relationships

6

Measures of Goodness of Fit

Goodness of fit measures help evaluate how well the regression model explains the variation in the dependent variable.

Sum of Squares Decomposition

$$TSS = ESS + RSS$$

Where:

TSS (Total Sum of Squares): $\sum(Y_i - \bar{Y})^2$
ESS (Explained Sum of Squares): $\sum(\hat{Y}_i - \bar{Y})^2$
RSS (Residual Sum of Squares): $\sum(Y_i - \hat{Y}_i)^2$

Coefficient of Determination (R²)

$$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$$

Properties of R²

Range: 0 ≤ R² ≤ 1
R² = 0: X explains none of the variation in Y
R² = 1: X perfectly explains all variation in Y
Higher R² indicates better fit

Alternative R² Formula

$$R^2 = r_{XY}^2$$

Where r_XY is the correlation coefficient between X and Y.

Standard Error of Estimate

$$s_e = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}}$$

Interpretation: Average distance of data points from the regression line.

R² Interpretation in Finance

If a regression of stock returns on market returns yields R² = 0.64:

64% of the stock's return variation is explained by market returns
36% is due to company-specific (idiosyncratic) factors
This suggests the stock has substantial systematic risk

Caution with R²

High R² doesn't necessarily mean the model is good. Consider also the economic significance of relationships and whether assumptions are met.

7

Hypothesis Testing in Regression

Hypothesis testing allows us to make statistical inferences about the population parameters and assess the significance of relationships.

Testing the Slope Coefficient

Most common test: Is there a significant linear relationship?

$$H_0: \beta = 0 \quad \text{vs} \quad H_1: \beta \neq 0$$

Test Statistic

$$t = \frac{b - 0}{SE(b)} = \frac{b}{SE(b)}$$

Under H₀, this follows a t-distribution with (n-2) degrees of freedom.

Decision Rule

Critical Value Approach: Reject H₀ if |t| > t_(α/2, n-2)
P-value Approach: Reject H₀ if p-value < α

Testing Specific Values of β

$$H_0: \beta = \beta_0 \quad \text{vs} \quad H_1: \beta \neq \beta_0$$

$$t = \frac{b - \beta_0}{SE(b)}$$

F-Test for Overall Model Significance

Tests whether the regression model is significant:

$$F = \frac{MSR}{MSE} = \frac{ESS/1}{RSS/(n-2)}$$

Where:

MSR = Mean Square Regression
MSE = Mean Square Error

In simple regression: F = t² (equivalent tests)

Beta Testing Example

Testing if a stock's beta equals 1 (market risk):

H₀: β = 1 vs H₁: β ≠ 1

If b = 1.3, SE(b) = 0.15, n = 52

$$t = \frac{1.3 - 1}{0.15} = \frac{0.3}{0.15} = 2.0$$

With df = 50, t₀.₀₂₅ = 2.009

Since |2.0| < 2.009, fail to reject H₀

Conclusion: Beta is not significantly different from 1

8

Prediction and Forecasting

Regression models can be used to make predictions for new observations. Different types of predictions have different levels of uncertainty.

Point Prediction

For a new observation with X = X₀:

$$\hat{Y}_0 = a + bX_0$$

Confidence Interval for Mean Response

Interval for the expected value of Y when X = X₀:

$$\hat{Y}_0 \pm t_{\alpha/2, n-2} \times s_e \sqrt{\frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$$

Prediction Interval for Individual Response

Interval for a single new observation when X = X₀:

$$\hat{Y}_0 \pm t_{\alpha/2, n-2} \times s_e \sqrt{1 + \frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$$

Key Differences

Confidence Interval: Narrower; uncertainty about the mean
Prediction Interval: Wider; includes individual variation
Both intervals are narrowest at X = X̄ and widen as X₀ moves away from X̄

Investment Prediction Example

Using our sales prediction model Ŷ = 100 + 4X:

To predict sales when advertising = $25k:

Ŷ = 100 + 4(25) = 200 (thousand dollars)

If we want a prediction interval for an individual company (assuming se = 10, n = 50, etc.), the interval might be:

200 ± 20.1, or [$179.9k, $220.1k]

Extrapolation Warning

Predictions are most reliable within the range of observed X values. Extrapolating beyond this range can be very unreliable.

9

Residual Analysis

Residual analysis involves examining the differences between actual and predicted values to check model assumptions and identify potential problems.

Definition of Residuals

$$e_i = Y_i - \hat{Y}_i = Y_i - (a + bX_i)$$

Properties of Residuals

Sum of residuals equals zero: Σeᵢ = 0
Residuals are uncorrelated with X: Σeᵢ Xᵢ = 0
Residuals are uncorrelated with Ŷ: Σeᵢ Ŷᵢ = 0

Standardized Residuals

$$\text{Standardized Residual} = \frac{e_i}{s_e}$$

Should be approximately N(0,1) if assumptions are met.

Common Residual Plots

1. Residuals vs Fitted Values

Good pattern: Random scatter around zero
Problems: Curved patterns (non-linearity), funnel shapes (heteroscedasticity)

2. Residuals vs Independent Variable

Similar interpretation as residuals vs fitted values
Helps identify if transformation of X is needed

3. Normal Q-Q Plot of Residuals

Tests normality assumption
Points should lie approximately on a straight line

4. Residuals vs Order (Time Series)

Checks for autocorrelation in time series data
Look for patterns or trends over time

Pattern in Residual Plot	Indicates	Potential Solution
Random scatter	Model assumptions satisfied	No action needed
Curved pattern	Non-linearity	Transform variables or use polynomial
Funnel shape	Heteroscedasticity	Transform Y or use weighted regression
Systematic pattern over time	Autocorrelation	Use time series methods

Outlier Detection

Standardized residuals with |value| > 2 or 3 may indicate outliers. In finance, these could represent unusual market events or data errors.

10

Limitations and Common Pitfalls

Understanding the limitations of simple linear regression is crucial for proper application and interpretation in financial analysis.

Key Limitations

1. Correlation vs Causation

Regression shows association, not causation
High R² doesn't imply one variable causes the other
Consider confounding variables and reverse causality

2. Linear Relationship Assumption

Many financial relationships are non-linear
Option pricing, volatility relationships often curved
May need transformations or polynomial terms

3. Sensitivity to Outliers

Single extreme observations can dramatically affect results
Common in financial data due to market crashes, bubbles
Consider robust regression methods

4. Extrapolation Risks

Predictions outside the data range may be unreliable
Economic relationships may break down in extreme conditions
Structure breaks during financial crises

Common Pitfalls in Financial Applications

Spurious Regression

When both variables follow trends over time, regression may show significant relationships even when none exist. Common with:

Stock prices over time (both trending up)
GDP and stock market indices
Solution: Use returns or differenced data

Time-Varying Relationships

Beta may change during different market regimes
Relationships may be unstable across time periods
Consider rolling regressions or regime-switching models

Heteroscedasticity in Financial Data

Volatility clustering in financial time series
Higher volatility during market stress
Standard errors may be incorrect

Best Practices

Always plot the data before fitting regression
Examine residuals carefully
Test for structural breaks in long time series
Consider economic theory when interpreting results
Use robust standard errors when appropriate
Validate models on out-of-sample data

11

Chapter Summary

Key Learning Points

Simple Linear Regression: Models linear relationship between two continuous variables
Least Squares: Method to find best-fitting line by minimizing sum of squared residuals
Model Assumptions: Linearity, independence, homoscedasticity, normality
Coefficient Interpretation: Slope shows change in Y per unit change in X
R²: Proportion of variance in Y explained by X

Financial Applications

Calculating beta (systematic risk) of securities
Modeling relationships between economic variables
Forecasting based on leading indicators
Performance attribution analysis
Risk factor analysis

Critical Formulas

$$\hat{Y} = a + bX$$

Regression Equation

$$b = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}$$

Slope Coefficient

$$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$$

Coefficient of Determination

$$t = \frac{b}{SE(b)}$$

t-Statistic for Slope

Key Insights for CFA Candidates

Understand the difference between correlation and regression
Know how to interpret regression output and test significance
Recognize when regression assumptions are violated
Apply regression analysis to portfolio management problems
Understand limitations and potential pitfalls

Next Steps

The final chapter will explore Big Data Techniques, covering modern approaches to handling large datasets and advanced analytical methods increasingly important in finance.