623 — Linear regression

Step 1 of 14

Eastern Suburbs: Scatter plot

Each dot on this scatter plot shows one recent house sale in the Eastern Suburbs. The horizontal axis is floor area in square metres and the vertical axis is sale price in millions of dollars. Even before drawing a line, you can already see a gentle upward trend: larger homes generally sell for more.

Check understanding

Looking at the cloud of points for the Eastern Suburbs, what overall pattern do you see?

  1. Upward trend → higher price
  2. Upward trend → lower price
  3. No relationship
Step 2 of 14

Fit regression line

A linear regression model fits a single straight line through the scatter of points. This line is chosen to best summarise the overall trend in the data, so that for each floor area the line gives our model’s best guess for the sale price.

Step 3 of 14

Make predictions

Once we have the regression equation, we can use it to make predictions. For any floor area on the x-axis we move up to the line and then across to read off the predicted sale price on the y-axis. Every prediction comes from the line, not from guessing individual points.

Check understanding

When the model predicts a price for a given floor area, where does that prediction come from?

  1. Predictions come from the line
  2. Predictions come from average
  3. Predictions come from median
Step 4 of 14

Examine residuals

For each house, the residual is the vertical distance between the actual sale price and the predicted price on the regression line. A positive residual means the house sold for more than the model expected; a negative residual means it sold for less.

Check understanding

On a regression plot, how is a residual measured for a single data point?

  1. Horizontal
  2. Vertical
  3. Measured on x-axis
Step 5 of 14

Interpret residuals

Large residuals indicate houses where the model’s prediction is quite far from the actual sale price. A cluster of large residuals may reveal parts of the market where a straight-line model is not capturing important factors, such as luxury properties or heavily renovated homes.

Step 6 of 14

MSE (Eastern)

Mean squared error (MSE) takes every residual, squares it so that large errors count more, and then averages the squared values. A smaller MSE means the model’s predictions are closer to the actual prices overall for the Eastern Suburbs dataset.

Check understanding

For the Eastern Suburbs data, what does a smaller MSE value tell you about the regression model?

  1. Smaller MSE = better
  2. Larger MSE = better
  3. MSE does not relate to error
Step 7 of 14

R² (Eastern)

The coefficient of determination, R², tells us what proportion of the variation in sale price can be explained by the regression line. An R² value closer to 1 means that differences in floor area explain most of the differences in price for the Eastern Suburbs data.

Check understanding

When interpreting R² for the Eastern Suburbs model, what does a higher value mean?

  1. Higher R² = stronger model
  2. R² shows prediction speed
  3. R² shows number of samples
Step 8 of 14

Western Suburbs: Scatter plot

This scatter plot shows the same variables for houses in Western Sydney. The trend is weaker and the points are more spread out, suggesting that floor area alone does not explain price as strongly as in the Eastern Suburbs.

Check understanding

Comparing the overall pattern of points, how does the Western Sydney trend look compared to the Eastern Suburbs?

  1. Stronger trend
  2. Weaker trend
  3. No trend
Step 9 of 14

Fit regression line (Western)

We now fit a separate regression line for the Western Sydney dataset. Because the trend is weaker, the line has a lower slope and the points sit further away from it, indicating a less reliable linear relationship.

Check understanding

Looking at the fitted line for Western Sydney, how would you describe the strength of the fit?

  1. Line fits strongly
  2. Line fits weakly
  3. No line required
Step 10 of 14

Predictions (Western)

We can still make predictions from the Western Sydney regression line, but because the points are more scattered, these predictions are less reliable. Two houses with similar floor areas can end up with quite different sale prices.

Check understanding

Compared with the Eastern model, how reliable are price predictions from the Western Sydney regression line?

  1. Predictions are more reliable
  2. Predictions are less reliable
  3. Predictions are identical
Step 11 of 14

Residuals (Western)

In Western Sydney the residuals form a funnel shape: errors are relatively small for cheaper houses but become much larger for expensive properties. This increasing spread of residuals is called heteroscedasticity and it reduces our confidence in the model for high-price homes.

Check understanding

What does the funnel-shaped pattern of residuals in Western Sydney tell you about the model?

  1. Increasing variance reduces reliability
  2. Funnel means perfect fit
  3. Funnel means small errors
Step 12 of 14

MSE Comparison

By comparing the mean squared error of the Eastern and Western models, we can see which regression line makes smaller average squared mistakes. The region with the lower MSE has, overall, more accurate predictions for this dataset.

Check understanding

If the Eastern model has a lower MSE than the Western model, what does that mean?

  1. Eastern better
  2. Western better
  3. Same
Step 13 of 14

R² Comparison

We can also compare the R² values for the two models. A higher R² indicates that floor area explains more of the variation in price for that region. The model with higher R² captures a clearer linear relationship between size and price.

Check understanding

If the Eastern model has a higher R² than the Western model, what conclusion can you draw?

  1. Eastern higher
  2. Western higher
  3. Same
Step 14 of 14

Summary

Linear regression lets us build simple predictive models from numerical data, but their usefulness depends on how well the straight line captures the real pattern. By examining residuals, MSE, and R² across different suburbs, we can judge where the model works well and where we should be cautious.