Sample essay from Teaching Statistics Using Baseball by Jim Albert

CASE STUDY 4-5:  How Important is a Run?

TOPICS COVERED:  Nonlinear regression, transformation, residuals.

In this chapter, we have discussed how it is important for a baseball team to score runs and we evaluated the goodness of different batting measures by their relationship with runs scored.  But of course the objective of a baseball team is not solely to score runs --- it wins games by scoring more runs than their opponent. 

That raises the interesting question.  What is the importance of a single run towards the goal of winning a baseball game?  If a player is responsible for scoring, say 20 runs, then how many wins for his team has he contributed?

Bill James discovered a special relationship between the number of wins (W) and losses (L) for a baseball team and the number of runs scored  (R) and number of runs allowed (RA).  He called this relationship “The Pythagorean Method ”.   This result says that the ratio between a team’s wins and losses is approximately equal to the square of the ratio of runs scored to runs allowed.  That is, approximately,

If we take logs of both sides, we get the equivalent relationship

(We take logs to convert a nonlinear equation  in the runs ratio to a linear equation.)  Can we demonstrate that the Pythagorean Method gives a good description of the relationship between the win/loss pattern and the runs scored/allowed for current Major League Baseball teams?

To answer this question, we look at the relevant team statistics (wins, losses, runs scored, and runs allowed) for the 30 teams for the 2000 season.  Looking at Table 4-21, we see that if a team has a winning record, then it generally scores more runs than its opponent.  But there is one interesting exception – Toronto had a win/loss record of 83-79 but actually allowed 47 more runs this season than they scored. (We suppose that Toronto won a lot of close games in 2000.)

Table 4-21:  Team statistics for the Major League teams in the 2000 season.

Team

W

L

R

RA

 

Team

W

L

R

RA

Anaheim

82

80

864

869

 

Milwaukee

73

89

740

826

Arizona

85

77

792

754

 

Minnesota

69

93

748

880

Atlanta

95

67

810

714

 

Montreal

67

95

738

902

Baltimore

74

88

794

913

 

New York_AL

87

74

871

814

Boston

85

77

792

745

 

New York_NL

94

68

807

738

Chicago_AL

95

67

978

839

 

Oakland

91

70

947

813

Chicago_NL

65

97

764

904

 

Philadelphia

65

97

708

830

Cincinnati

85

77

825

765

 

Pittsburgh

69

93

793

888

Cleveland

90

72

950

816

 

San Diego

76

86

752

815

Colorado

82

80

968

897

 

San Francisco

97

65

925

747

Detroit

79

83

823

827

 

Seattle

91

71

907

780

Florida

79

82

731

797

 

St. Louis

95

67

887

771

Houston

72

90

938

944

 

Tampa Bay

69

92

733

842

Kansas City

77

85

879

930

 

Texas

71

91

848

974

Los Angeles

86

76

798

729

 

Toronto

83

79

861

908

To look for the Pythagorean relationship, we compute log(W/L) and log(R/RA) for all teams and construct a scatterplot  of the two quantities in Figure 4-8.

Figure 4-8:  Scatterplot of log runs ratio against log of ratio of wins to losses for Major League team data from the 2000 season.

 

We see a linear positive association in this graph, indicating that there is indeed a linear association between log(W/L) and log(R/RA).

Next we want to fit a “best line” to this graph.  It seems natural to restrict this line to pass through one point.  If a team scores the same number of runs against its opponents (R = RA), then we expect the team to win half of its games (W = L).  In other words, the point (log(R/RA), log(W/L)) = (0, 0) should fall on the line.  With this restriction, we look at line fits of the form

log(W/L) = k log(R/RA).

We choose k by using a least-squares  criterion.   It turns out that the sum of squared residuals is minimized when k = 1.91.  Figure 4-9 shows this best line on the scatterplot and a display of the corresponding residuals .   We do not see any linear trend or any other pattern in the residual plot, so it appears that our fit is satisfactory.

Figure 4-9:  Least-squares fit (top) and residual plot (bottom) for (R/RA, W/L) data.

  So based on our analysis, we arrive at the relationship

which is pretty close to James’ Pythagorean relationship which uses the power of 2.  How useful is this rule in predicting a team’s win numbers?  To check the accuracy of this relationship in prediction, Table 4-22 gives the actual number of wins, the predicted number of wins (using the above model) and the residual (actual – predicted).  Figure 4-10 displays a stemplot  of the absolute residuals .

Table 4-22:  Number of wins, predicted number of wins, and residuals using James’ Pythagorean relationship.

Team

W

predicted

residual

 

Team

W

predicted

residual

Anaheim

82

80.6

1.4

 

Milwaukee

73

72.5

0.5

Arizona

85

84.8

0.2

 

Minnesota

69

68.5

0.5

Atlanta

95

90.7

4.3

 

Montreal

67

65.6

1.4

Baltimore

74

70.2

3.8

 

New York_AL

87

85.7

1.3

Boston

85

85.7

-0.7

 

New York_NL

94

87.9

6.1

Chicago_AL

95

92.8

2.2

 

Oakland

91

92.2

-1.2

Chicago_NL

65

68.1

-3.1

 

Philadelphia

65

68.8

-3.8

Cincinnati

85

86.8

-1.8

 

Pittsburgh

69

72.3

-3.3

Cleveland

90

92.7

-2.7

 

San Diego

76

74.8

1.2

Colorado

82

86.9

-4.9

 

San Francisco

97

97.3

-0.3

Detroit

79

80.6

-1.6

 

Seattle

91

92.6

-1.6

Florida

79

73.9

5.1

 

St. Louis

95

91.8

3.2

Houston

72

80.5

-8.5

 

Tampa Bay

69

69.9

-0.9

Kansas City

77

76.6

0.4

 

Texas

71

70.3

0.7

Los Angeles

86

88

-2

 

Toronto

83

76.9

6.1

 

 

 

 

 

 

 

 

 

                    ABSOLUTE RESIDUALS  

                 0 | 23455779
                 1 | 22344668
                 2 | 027
                 3 | 12388
                 4 | 39
                 5 | 1
                 6 | 11
                 7 | 
                 8 | 5

Figure 4-10:  Stemplot of the abolute residuals from the fit using James’ Pythagorean relationship.

We see from the stemplot that 24 of the 30 residuals sizes are smaller than 4.  This indicates that for 80% of the teams, we can predict the number of wins to within 4 games using this formula.