Given a scatterplot, my students could pencil in what they thought was a line of best fit, but they had difficulty describing how they did it, or why their 'best' line was better than any other. Today I let my students decide how to define the line of best fit. What characteristics should it have? Armed with graph paper and access to the board, they came up with a list of possible qualifications and carefully considered each suggestion until we reached a satisfying answer.
First ideas
We started by saying that it should pass through the most points, but students were able to quickly create a few counterexamples, some scatterplots where they knew their idea of a line of best fit wouldn't follow that rule. One counterexample was three collinear points and one point not on that line. When (4,4) is added to the points (2,1), (4,2), and (6,3), students do not think that the line of best fit should be $$\hat{y} = {1 \over 2} x.$$
Another idea: maybe an equal number of points should be above and below the line. The same counterexample as above worked here too; students don't accept a line of best fit being $$x=4.$$
I intervened here and simply gave the definition of the residual so that we'd have something to calculate.
Meatier suggestions
Our next guess was that the sum of residuals should be zero, or at the least, minimal. There was general agreement with this rule; keep reading.
At the same time, another student thought that since we were measuring distance, we should use the absolute values of the residuals. A counterexample here was 4 collinear points along with a point two units above each original, two parallel rows of points. In this case, any line between the two rows gives the same sum of absolute values.
We spent some time deciding if minimizing the sum of residuals would lead to the line of best fit. But with just enough time left in the class, a few students figured it out! They noticed that you could pick any slope then find a y-intercept which yields a line where the residuals were equal to zero. Therefore, we couldn't have THE line of best fit if there were infinitely many of them, one for each possible slope.
Here's one way to see this fact:
For any slope, there is a line such that the sum of the residuals is zero.
Drag the purple slider to pick your own slope.
The green line is the least-squares regression line.
There's probably an algebraic way to see this too, considering that all lines intersect at $$(\bar{x}, \bar{y}).$$
Well, we were back to the drawing board, but we revisited why we would reject a sum of residuals for the absolute value, and since that's one way of making values nonnegative, we might as well try another. My students discovered 'least squares.' Next class, they'll start by trying out their newly discovered rule:
Slide to the next stage, then move the blue points to find the line of best fit for the points shown.
Click the button for a new set of points.
The object is to estimate the 'least squares' line.
Stage 1 shows you the points.
Stage 2 gives you the opportunity to estimate the line of best fit.
Stage 3 plots the residuals.
Stage 4 give you the 'square' of the residuals.
Stage 5 reveals the true least squares regression line.
Reflection
This lesson gives students some more visual evidence that the sum of residuals being zero isn't enough to consider some line one of best fit. While I might not prove that the slope of such a line should be $$r {s_y \over s_x},$$ at least students now think 'least-squares.'
Some of my students were shocked to see me write four or five 'rules' on the board only to discuss counterexamples and disgard them, but some day I hope they appreciate the fact that this lesson was more about the process and the level of thinking about proof required to come up with a good rule.
An earlier version of this post appeared on the AP Statistics group and received some great feedback.