Thursday, May 3, 2012

Simple Linear Regression Illustration

Given a set of coordinate points, how do we find the linear regression using least square method? Imagine the set of 10 points like this:

Xi Yi
215 30.8
201 32.5
196 35.4
226 28.1
226 24.4
348 24.1
226 28.5
348 24.2
148 32.8
226 28.0

The regression line will take on the form of y = B0 + B1*x, with variance σ^2. The equations for the variables B0, B1, and σ^2 are:
  • B1 = (Σxi*yi - Σxi*Σyi/n) / (Σxi^2-(Σxi)^2/n)
  • B0 = ӯ(n) - B1*x̄(n)
  • σ^2 = (Σyi^2-n*ӯ(n)^2-B1*(Σxi*yi-Σxi*Σyi/n)) / (n-2)
Now create columns for the square of the x- and y-values, as well as the products between them. Sum each column as well:

Xi Yi
Xi^2 Yi^2 Xi*Yi

215 30.8
46225 948.64 6622

201 32.5
40401 1056.25 6532.5

196 35.4
38416 1253.16 6938.4

226 28.1
51076 789.61 6350.6

226 24.4
51076 595.36 5514.4

348 24.1
121104 580.81 8386.8

226 28.5
51076 812.25 6441

348 24.2
121104 585.64 8421.6

148 32.8
21904 1075.84 4854.4

226 28.0
51076 784 6328







Sum 2360 288.8
593458 8481.56 66389.7
  • B1 = (66389.7-2360*288.8/10) / (593458-2360^2/10) = -0.0484
  • B0 = (288.8/10) - (-0.0484)(2360/10) = 40.3024
  • σ^2 = (8481.56-10*(288.8/10)^2-(-0.0484)*(66389.7-2360*288.8/10)) / 8 = 6.9360
At last, let's compare the results to Excel:


Although the value of the coefficient of determination is quite low, the equation of the least-square best-fit line corresponds with the numbers we obtained. Also, using the function =STEYX(C2:C11,B2:B11) in Excel, the sample standard deviation was calculated to be 2.632951. Squaring that number, we get 6.932, nearly identical as the value we calculated above as well.