Regression
What is linear regression
Regression: Technique to model linear relationships between two quantitative variables.
Goal: Predict values of a response variable (dependent) using an explanatory variable (independent/predictor).
Equations:
- Statistics notation:
: Slope (change in per unit change in ). : Intercept (predicted when ).
- Example: Predicting son's height (
) from father's height ( ):
2. Key Calculations
-
Slope (
):
: Standard deviations of and . Pearson's Correlation Coefficient. - Example: For father-son heights:
-
Intercept (
):
- Example:
- Example:
3. Residuals & Model Fit
-
Residual (
):
- Interpretation:
- Positive
: Underpredicted value. - Negative
: Overpredicted value.
- Positive
- Example: Father's height = 65 inches → Predicted son's height = 67.30 inches. Observed son's height = 59.8 inches:
Interpretation: Overpredicted by 7.5 inches.
- Interpretation:
-
Quality of Fit:
(Coefficient of Determination):
- Interpretation: % of variation in
explained by . - Example:
→ .
25.1% of son's height variation explained by father's height.
- Interpretation: % of variation in
4. Assumptions & Pitfalls
- Linearity: Relationship must be approximately linear (verify with scatterplot).
- Extrapolation: Avoid predictions outside the observed data range.
- Example: Predicting son's height for a father with 0 inches is nonsensical.
- Correlation ≠ Causation: High
does not imply causes . - Outliers: Can disproportionately affect slope and intercept.
5. Comparison with Correlation
Aspect | Correlation ( |
Regression |
---|---|---|
Symmetry | Symmetric ( |
Asymmetric (switching |
Units | Unitless | Slope has units ( |
Purpose | Measures association | Predicts |
7. Complementary Tools
- Scatterplots: Visualize linearity, strength, direction, and outliers.
- Robust Measures:
- Spearman's rank correlation
: Use for monotonic but non-linear relationships. - Order Statistics (Median, IQR): Describe skewed data or outliers.
- Spearman's rank correlation
Multiple Linear Regression (MLR)
MLR extends simple linear regression by incorporating multiple independent variables to predict a single dependent variable.
Key Components:
- Model equation:
- Partial regression coefficients: Effect of each
on while holding other variables constant
Key Measures:
- Multiple
: Total variance explained by all predictors - Adjusted
: Accounts for the number of predictors (penalizes overfitting) - F-statistic: Tests overall significance of the model
- t-statistics: Test significance of individual predictors
Advanced Concepts:
- Multicollinearity: Correlation among predictors
- VIF (Variance Inflation Factor): Detects multicollinearity
- Standardized coefficients (Beta): Compare predictors with different scales
- Interaction effects: When the effect of one predictor depends on another
- Polynomial regression: Adding higher-order terms (
, , etc.)
Regression Error Analysis
Regression error refers to the discrepancy between observed values and those predicted by the regression model.
Key Error Metrics:
- Residuals:
(observed minus predicted values) - Mean Squared Error (MSE): Average of squared residuals
- Root Mean Squared Error (RMSE): Square root of MSE, in same units as
- Mean Absolute Error (MAE): Average of absolute residuals
- Residual Sum of Squares (RSS or SSE):
Diagnostic Tools:
- Residual plots: Visual check for patterns in residuals
- Q-Q plots: Assess normality of residuals
- Cook's distance: Identify influential observations
- Leverage: Measure of how far an observation is from others in X-space
- DFFITS and DFBETAS: Measure influence of observations on model
Common Error Issues:
- Heteroscedasticity: Non-constant variance of errors
- Autocorrelation: Correlation among residuals
- Non-normality: When residuals don't follow normal distribution
- Specification error: Incorrect functional form of model
Model Validation:
- Cross-validation: Testing model performance on unseen data
- Train-test split: Divided dataset for model building and validation
- k-fold cross-validation: Repeatedly partition data for robust validation
Connections Between ANOVA and Regression
- ANOVA as a special case of regression: Categorical predictors in regression are equivalent to ANOVA
- Regression ANOVA table: Partitions variance similar to traditional ANOVA
- F-test in regression: Tests overall model significance using ANOVA principles