Chapter 10

Simple Linear Regression

Master the fundamentals of simple linear regression, learn to model relationships between variables, and apply these techniques to financial analysis and forecasting.

1

Introduction to Linear Regression

Simple linear regression is a statistical method used to model the relationship between two continuous variables. In finance, it's widely used for predicting returns, analyzing risk factors, and understanding relationships between economic variables.

Key Concepts

  • Dependent Variable (Y): The variable we want to predict or explain
  • Independent Variable (X): The variable used to make predictions
  • Linear Relationship: A relationship that can be represented by a straight line
  • Regression Line: The best-fitting line through the data points

Applications in Finance

  • Modeling stock returns against market returns (Beta calculation)
  • Analyzing the relationship between interest rates and bond prices
  • Predicting company earnings based on economic indicators
  • Portfolio performance attribution
Financial Example

A portfolio manager might use regression to analyze how a stock's returns relate to market returns, helping to calculate the stock's beta and systematic risk.

2

The Simple Linear Regression Model

The simple linear regression model describes the relationship between a dependent variable Y and a single independent variable X using a linear equation.

Population Regression Model

$$Y_i = \alpha + \beta X_i + \epsilon_i$$

Where:

  • Y_i = Value of dependent variable for observation i
  • α (alpha) = Population intercept (Y-intercept)
  • β (beta) = Population slope coefficient
  • X_i = Value of independent variable for observation i
  • ε_i (epsilon) = Random error term for observation i

Sample Regression Model

$$\hat{Y}_i = a + b X_i$$

Where:

  • Ŷ_i = Predicted value of Y for observation i
  • a = Sample intercept (estimate of α)
  • b = Sample slope coefficient (estimate of β)

Interpretation of Coefficients

  • Intercept (a): Expected value of Y when X = 0
  • Slope (b): Change in Y for a one-unit increase in X
Financial Interpretation Example

If we regress a stock's returns against market returns and get:

$$\text{Stock Return} = 0.02 + 1.3 \times \text{Market Return}$$
  • Intercept (0.02): Expected stock return when market return is 0%
  • Slope (1.3): Stock's beta; for every 1% increase in market return, stock return increases by 1.3%
3

Least Squares Method

The least squares method finds the regression line that minimizes the sum of squared differences between actual and predicted values. This provides the "best fit" line through the data points.

Objective Function

Minimize the sum of squared residuals:

$$\text{Minimize: } \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - a - bX_i)^2$$

Normal Equations

Taking partial derivatives and setting them to zero gives us:

$$b = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} = \frac{S_{XY}}{S_{XX}}$$
$$a = \bar{Y} - b\bar{X}$$

Where:

  • X̄ = Sample mean of X values
  • Ȳ = Sample mean of Y values
  • S_XY = Sample covariance between X and Y
  • S_XX = Sample variance of X

Alternative Formula for Slope

$$b = r_{XY} \times \frac{s_Y}{s_X}$$

Where r_XY is the correlation coefficient between X and Y.

Calculation Example

Given data for 5 companies' advertising spending (X) and sales revenue (Y):

Company Advertising ($000) Sales ($000)
A10120
B15150
C20180
D25210
E30240

Calculations:

  • X̄ = 20, Ȳ = 180
  • S_XX = 250, S_XY = 1000
  • b = 1000/250 = 4
  • a = 180 - 4(20) = 100

Regression equation: Ŷ = 100 + 4X

4

Properties of Regression Coefficients

Understanding the statistical properties of regression coefficients is crucial for proper interpretation and inference in financial analysis.

Properties of the Slope Coefficient (b)

  • Unbiased: E[b] = β (expected value equals true population parameter)
  • Consistent: Converges to true value as sample size increases
  • Normally distributed: Under certain conditions

Standard Error of the Slope

$$SE(b) = \frac{s_e}{\sqrt{S_{XX}}} = \frac{s_e}{\sqrt{\sum(X_i - \bar{X})^2}}$$

Where s_e is the standard error of the estimate:

$$s_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} = \sqrt{\frac{SSE}{n-2}}$$

Standard Error of the Intercept

$$SE(a) = s_e \sqrt{\frac{1}{n} + \frac{\bar{X}^2}{S_{XX}}}$$

Confidence Intervals

For the slope coefficient:

$$b \pm t_{\alpha/2, n-2} \times SE(b)$$

For the intercept:

$$a \pm t_{\alpha/2, n-2} \times SE(a)$$
Degrees of Freedom

In simple linear regression, we use (n-2) degrees of freedom because we estimate two parameters (intercept and slope) from the data.

5

Regression Model Assumptions

Linear regression relies on several key assumptions. Violations of these assumptions can lead to biased estimates and invalid statistical inferences.

The Five Key Assumptions

1. Linearity

  • The relationship between X and Y is linear
  • E[Y|X] = α + βX
  • Check: Scatter plot of Y vs X should show linear pattern

2. Independence

  • Observations are independent of each other
  • Error terms are uncorrelated: Cov(εᵢ, εⱼ) = 0 for i ≠ j
  • Common violation: Time series data with autocorrelation

3. Homoscedasticity (Constant Variance)

  • Error terms have constant variance: Var(εᵢ) = σ² for all i
  • Check: Plot residuals vs fitted values
  • Violation: Heteroscedasticity (changing variance)

4. Normality

  • Error terms are normally distributed: εᵢ ~ N(0, σ²)
  • Required for hypothesis testing and confidence intervals
  • Check: Q-Q plot of residuals

5. No Perfect Multicollinearity

  • Not directly applicable to simple regression
  • X values must show variation (not all identical)
  • Becomes critical in multiple regression
Assumption Violations in Finance
  • Time Series Data: Often violates independence assumption due to autocorrelation
  • Heteroscedasticity: Common in financial data where volatility changes over time
  • Non-linearity: Option prices vs underlying asset prices show non-linear relationships
6

Measures of Goodness of Fit

Goodness of fit measures help evaluate how well the regression model explains the variation in the dependent variable.

Sum of Squares Decomposition

$$TSS = ESS + RSS$$

Where:

  • TSS (Total Sum of Squares): $\sum(Y_i - \bar{Y})^2$
  • ESS (Explained Sum of Squares): $\sum(\hat{Y}_i - \bar{Y})^2$
  • RSS (Residual Sum of Squares): $\sum(Y_i - \hat{Y}_i)^2$

Coefficient of Determination (R²)

$$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$$

Properties of R²

  • Range: 0 ≤ R² ≤ 1
  • R² = 0: X explains none of the variation in Y
  • R² = 1: X perfectly explains all variation in Y
  • Higher R² indicates better fit

Alternative R² Formula

$$R^2 = r_{XY}^2$$

Where r_XY is the correlation coefficient between X and Y.

Standard Error of Estimate

$$s_e = \sqrt{\frac{RSS}{n-2}} = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}}$$

Interpretation: Average distance of data points from the regression line.

R² Interpretation in Finance

If a regression of stock returns on market returns yields R² = 0.64:

  • 64% of the stock's return variation is explained by market returns
  • 36% is due to company-specific (idiosyncratic) factors
  • This suggests the stock has substantial systematic risk
Caution with R²

High R² doesn't necessarily mean the model is good. Consider also the economic significance of relationships and whether assumptions are met.

7

Hypothesis Testing in Regression

Hypothesis testing allows us to make statistical inferences about the population parameters and assess the significance of relationships.

Testing the Slope Coefficient

Most common test: Is there a significant linear relationship?

$$H_0: \beta = 0 \quad \text{vs} \quad H_1: \beta \neq 0$$

Test Statistic

$$t = \frac{b - 0}{SE(b)} = \frac{b}{SE(b)}$$

Under H₀, this follows a t-distribution with (n-2) degrees of freedom.

Decision Rule

  • Critical Value Approach: Reject H₀ if |t| > t_(α/2, n-2)
  • P-value Approach: Reject H₀ if p-value < α

Testing Specific Values of β

$$H_0: \beta = \beta_0 \quad \text{vs} \quad H_1: \beta \neq \beta_0$$
$$t = \frac{b - \beta_0}{SE(b)}$$

F-Test for Overall Model Significance

Tests whether the regression model is significant:

$$F = \frac{MSR}{MSE} = \frac{ESS/1}{RSS/(n-2)}$$

Where:

  • MSR = Mean Square Regression
  • MSE = Mean Square Error

In simple regression: F = t² (equivalent tests)

Beta Testing Example

Testing if a stock's beta equals 1 (market risk):

H₀: β = 1 vs H₁: β ≠ 1

If b = 1.3, SE(b) = 0.15, n = 52

$$t = \frac{1.3 - 1}{0.15} = \frac{0.3}{0.15} = 2.0$$

With df = 50, t₀.₀₂₅ = 2.009

Since |2.0| < 2.009, fail to reject H₀

Conclusion: Beta is not significantly different from 1

8

Prediction and Forecasting

Regression models can be used to make predictions for new observations. Different types of predictions have different levels of uncertainty.

Point Prediction

For a new observation with X = X₀:

$$\hat{Y}_0 = a + bX_0$$

Confidence Interval for Mean Response

Interval for the expected value of Y when X = X₀:

$$\hat{Y}_0 \pm t_{\alpha/2, n-2} \times s_e \sqrt{\frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$$

Prediction Interval for Individual Response

Interval for a single new observation when X = X₀:

$$\hat{Y}_0 \pm t_{\alpha/2, n-2} \times s_e \sqrt{1 + \frac{1}{n} + \frac{(X_0 - \bar{X})^2}{S_{XX}}}$$

Key Differences

  • Confidence Interval: Narrower; uncertainty about the mean
  • Prediction Interval: Wider; includes individual variation
  • Both intervals are narrowest at X = X̄ and widen as X₀ moves away from X̄
Investment Prediction Example

Using our sales prediction model Ŷ = 100 + 4X:

To predict sales when advertising = $25k:

Ŷ = 100 + 4(25) = 200 (thousand dollars)

If we want a prediction interval for an individual company (assuming se = 10, n = 50, etc.), the interval might be:

200 ± 20.1, or [$179.9k, $220.1k]

Extrapolation Warning

Predictions are most reliable within the range of observed X values. Extrapolating beyond this range can be very unreliable.

9

Residual Analysis

Residual analysis involves examining the differences between actual and predicted values to check model assumptions and identify potential problems.

Definition of Residuals

$$e_i = Y_i - \hat{Y}_i = Y_i - (a + bX_i)$$

Properties of Residuals

  • Sum of residuals equals zero: Σeᵢ = 0
  • Residuals are uncorrelated with X: Σeᵢ Xᵢ = 0
  • Residuals are uncorrelated with Ŷ: Σeᵢ Ŷᵢ = 0

Standardized Residuals

$$\text{Standardized Residual} = \frac{e_i}{s_e}$$

Should be approximately N(0,1) if assumptions are met.

Common Residual Plots

1. Residuals vs Fitted Values

  • Good pattern: Random scatter around zero
  • Problems: Curved patterns (non-linearity), funnel shapes (heteroscedasticity)

2. Residuals vs Independent Variable

  • Similar interpretation as residuals vs fitted values
  • Helps identify if transformation of X is needed

3. Normal Q-Q Plot of Residuals

  • Tests normality assumption
  • Points should lie approximately on a straight line

4. Residuals vs Order (Time Series)

  • Checks for autocorrelation in time series data
  • Look for patterns or trends over time
Pattern in Residual Plot Indicates Potential Solution
Random scatter Model assumptions satisfied No action needed
Curved pattern Non-linearity Transform variables or use polynomial
Funnel shape Heteroscedasticity Transform Y or use weighted regression
Systematic pattern over time Autocorrelation Use time series methods
Outlier Detection

Standardized residuals with |value| > 2 or 3 may indicate outliers. In finance, these could represent unusual market events or data errors.

10

Limitations and Common Pitfalls

Understanding the limitations of simple linear regression is crucial for proper application and interpretation in financial analysis.

Key Limitations

1. Correlation vs Causation

  • Regression shows association, not causation
  • High R² doesn't imply one variable causes the other
  • Consider confounding variables and reverse causality

2. Linear Relationship Assumption

  • Many financial relationships are non-linear
  • Option pricing, volatility relationships often curved
  • May need transformations or polynomial terms

3. Sensitivity to Outliers

  • Single extreme observations can dramatically affect results
  • Common in financial data due to market crashes, bubbles
  • Consider robust regression methods

4. Extrapolation Risks

  • Predictions outside the data range may be unreliable
  • Economic relationships may break down in extreme conditions
  • Structure breaks during financial crises

Common Pitfalls in Financial Applications

Spurious Regression

When both variables follow trends over time, regression may show significant relationships even when none exist. Common with:

  • Stock prices over time (both trending up)
  • GDP and stock market indices
  • Solution: Use returns or differenced data

Time-Varying Relationships

  • Beta may change during different market regimes
  • Relationships may be unstable across time periods
  • Consider rolling regressions or regime-switching models

Heteroscedasticity in Financial Data

  • Volatility clustering in financial time series
  • Higher volatility during market stress
  • Standard errors may be incorrect

Best Practices

  • Always plot the data before fitting regression
  • Examine residuals carefully
  • Test for structural breaks in long time series
  • Consider economic theory when interpreting results
  • Use robust standard errors when appropriate
  • Validate models on out-of-sample data
11

Chapter Summary

Key Learning Points

  • Simple Linear Regression: Models linear relationship between two continuous variables
  • Least Squares: Method to find best-fitting line by minimizing sum of squared residuals
  • Model Assumptions: Linearity, independence, homoscedasticity, normality
  • Coefficient Interpretation: Slope shows change in Y per unit change in X
  • R²: Proportion of variance in Y explained by X

Financial Applications

  • Calculating beta (systematic risk) of securities
  • Modeling relationships between economic variables
  • Forecasting based on leading indicators
  • Performance attribution analysis
  • Risk factor analysis

Critical Formulas

$$\hat{Y} = a + bX$$

Regression Equation

$$b = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - \bar{X})^2}$$

Slope Coefficient

$$R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}$$

Coefficient of Determination

$$t = \frac{b}{SE(b)}$$

t-Statistic for Slope

Key Insights for CFA Candidates

  • Understand the difference between correlation and regression
  • Know how to interpret regression output and test significance
  • Recognize when regression assumptions are violated
  • Apply regression analysis to portfolio management problems
  • Understand limitations and potential pitfalls

Next Steps

The final chapter will explore Big Data Techniques, covering modern approaches to handling large datasets and advanced analytical methods increasingly important in finance.