Thursday, January 5, 2012

Least-Square Linear Regression of Data Using C++



Question: implement the least-square method to determine the linear function that best fits the data. This method also needs to find the coefficient of determination (R^2) and standard error of estimation (E). Input to this method is a collection of data points (x, y) and the collection's size, a.k.a. number of data points.

Solution: the answer is straight forward. We basically just have to apply the statistics formulas for finding the least-square linear function to the data. If you are not familiar with the formulas and where they come from here is the link for you. Now, let's take a look at the implementation below:


#include<iostream>
#include<cmath>
using namespace std;

struct Point
{
   double x;
   double y;
};

void leastSqrRegression(struct Point* xyCollection, int dataSize)
{
   if (xyCollection == NULL || dataSize == 0)
   {
      printf("Empty data set!\n");
      return;
   }

   double SUMx = 0;     //sum of x values
   double SUMy = 0;     //sum of y values
   double SUMxy = 0;    //sum of x * y
   double SUMxx = 0;    //sum of x^2
   double SUMres = 0;   //sum of squared residue
   double res = 0;      //residue squared
   double slope = 0;    //slope of regression line
   double y_intercept = 0; //y intercept of regression line
   double SUM_Yres = 0; //sum of squared of the discrepancies
   double AVGy = 0;     //mean of y
   double AVGx = 0;     //mean of x
   double Yres = 0;     //squared of the discrepancies
   double Rsqr = 0;     //coefficient of determination

   //calculate various sums 
   for (int i = 0; i < dataSize; i++)
   {
      //sum of x
      SUMx = SUMx + (xyCollection + i)->x;
      //sum of y
      SUMy = SUMy + (xyCollection + i)->y;
      //sum of squared x*y
      SUMxy = SUMxy + (xyCollection + i)->x * (xyCollection + i)->y;
      //sum of squared x
      SUMxx = SUMxx + (xyCollection + i)->x * (xyCollection + i)->x;
   }

   //calculate the means of x and y
   AVGy = SUMy / dataSize;
   AVGx = SUMx / dataSize;

   //slope or a1
   slope = (dataSize * SUMxy - SUMx * SUMy) / (dataSize * SUMxx - SUMx*SUMx);

   //y itercept or a0
   y_intercept = AVGy - slope * AVGx;
   
   printf("x mean(AVGx) = %0.5E\n", AVGx);
   printf("y mean(AVGy) = %0.5E\n", AVGy);

   printf ("\n");
   printf ("The linear equation that best fits the given data:\n");
   printf ("       y = %2.8lfx + %2.8f\n", slope, y_intercept);
   printf ("------------------------------------------------------------\n");
   printf ("   Original (x,y)   (y_i - y_avg)^2     (y_i - a_o - a_1*x_i)^2\n");
   printf ("------------------------------------------------------------\n");

   //calculate squared residues, their sum etc.
   for (int i = 0; i < dataSize; i++) 
   {
      //current (y_i - a0 - a1 * x_i)^2
      Yres = pow(((xyCollection + i)->y - y_intercept - (slope * (xyCollection + i)->x)), 2);

      //sum of (y_i - a0 - a1 * x_i)^2
      SUM_Yres += Yres;

      //current residue squared (y_i - AVGy)^2
      res = pow((xyCollection + i)->y - AVGy, 2);

      //sum of squared residues
      SUMres += res;
      
      printf ("   (%0.2f %0.2f)      %0.5E         %0.5E\n", 
       (xyCollection + i)->x, (xyCollection + i)->y, res, Yres);
   }

   //calculate r^2 coefficient of determination
   Rsqr = (SUMres - SUM_Yres) / SUMres;
   
   printf("--------------------------------------------------\n");
   printf("Sum of (y_i - y_avg)^2 = %0.5E\t\n", SUMres);
   printf("Sum of (y_i - a_o - a_1*x_i)^2 = %0.5E\t\n", SUM_Yres);
   printf("Standard deviation(St) = %0.5E\n", sqrt(SUMres / (dataSize - 1)));
   printf("Standard error of the estimate(Sr) = %0.5E\t\n", sqrt(SUM_Yres / (dataSize-2)));
   printf("Coefficent of determination(r^2) = %0.5E\t\n", (SUMres - SUM_Yres)/SUMres);
   printf("Correlation coefficient(r) = %0.5E\t\n", sqrt(Rsqr));

}
 
 
Explanation: assuming that each data point is structured similar to our Point struct, then our method takes an array of Points and the array's size as its parameters.
To advance any further, we must first find these sums:
  • SUMx: the sum of all x values of the points
  • SUMy: the sum of all y values of the points
  • SUMxx: the sum of the square of x values, meaning that we square x values individually and add them togethers
  • SUMxy: for each point we multiply its x value with its y value, then add all the results together to find this sum.
The first for loop is used to calculate those sums simultaneously.
After that, we calculate the means of x and y values which equal sum of x values divided by the number of points and sum of y values divided by the number of points respectively.
slope and y_intercept of the best-fit function can then be determined using the means and the sums we found. Thus, the linear function is y = slope*x + y_intercept.

Finally, to calculate the coefficient of determination and standard error of estimation, we need to find the sum of squared standard deviation (SUM_Yres) of each point from the best-fit linear function and sum of the squared residues (SUMres). That's what the second for loop does.
Once, we know sum of squared residues and sum of squared standard deviation, we just apply formulas to find the coefficient and the standard error.
As you can see, the challenge is not writing the code to compute the least-square regression but being able to understand the logic behind that. Why slope, y_intercept, coefficient of determination and standard error of estimation are calculated that way? Where do the formulas come from? Etc..

 Source

0 comments:

Post a Comment