Question: implement the least-square method to determine the linear function that best fits the data. This method also needs to find the coefficient of determination (R^2) and standard error of estimation (E). Input to this method is a collection of data points (x, y) and the collection's size, a.k.a. number of data points.
Solution: the answer is straight forward. We basically just have to apply the statistics formulas for finding the least-square linear function to the data. If you are not familiar with the formulas and where they come from here is the link for you. Now, let's take a look at the implementation below:
#include<iostream> #include<cmath> using namespace std; struct Point { double x; double y; }; void leastSqrRegression(struct Point* xyCollection, int dataSize) { if (xyCollection == NULL || dataSize == 0) { printf("Empty data set!\n"); return; } double SUMx = 0; //sum of x values double SUMy = 0; //sum of y values double SUMxy = 0; //sum of x * y double SUMxx = 0; //sum of x^2 double SUMres = 0; //sum of squared residue double res = 0; //residue squared double slope = 0; //slope of regression line double y_intercept = 0; //y intercept of regression line double SUM_Yres = 0; //sum of squared of the discrepancies double AVGy = 0; //mean of y double AVGx = 0; //mean of x double Yres = 0; //squared of the discrepancies double Rsqr = 0; //coefficient of determination //calculate various sums for (int i = 0; i < dataSize; i++) { //sum of x SUMx = SUMx + (xyCollection + i)->x; //sum of y SUMy = SUMy + (xyCollection + i)->y; //sum of squared x*y SUMxy = SUMxy + (xyCollection + i)->x * (xyCollection + i)->y; //sum of squared x SUMxx = SUMxx + (xyCollection + i)->x * (xyCollection + i)->x; } //calculate the means of x and y AVGy = SUMy / dataSize; AVGx = SUMx / dataSize; //slope or a1 slope = (dataSize * SUMxy - SUMx * SUMy) / (dataSize * SUMxx - SUMx*SUMx); //y itercept or a0 y_intercept = AVGy - slope * AVGx; printf("x mean(AVGx) = %0.5E\n", AVGx); printf("y mean(AVGy) = %0.5E\n", AVGy); printf ("\n"); printf ("The linear equation that best fits the given data:\n"); printf (" y = %2.8lfx + %2.8f\n", slope, y_intercept); printf ("------------------------------------------------------------\n"); printf (" Original (x,y) (y_i - y_avg)^2 (y_i - a_o - a_1*x_i)^2\n"); printf ("------------------------------------------------------------\n"); //calculate squared residues, their sum etc. for (int i = 0; i < dataSize; i++) { //current (y_i - a0 - a1 * x_i)^2 Yres = pow(((xyCollection + i)->y - y_intercept - (slope * (xyCollection + i)->x)), 2); //sum of (y_i - a0 - a1 * x_i)^2 SUM_Yres += Yres; //current residue squared (y_i - AVGy)^2 res = pow((xyCollection + i)->y - AVGy, 2); //sum of squared residues SUMres += res; printf (" (%0.2f %0.2f) %0.5E %0.5E\n", (xyCollection + i)->x, (xyCollection + i)->y, res, Yres); } //calculate r^2 coefficient of determination Rsqr = (SUMres - SUM_Yres) / SUMres; printf("--------------------------------------------------\n"); printf("Sum of (y_i - y_avg)^2 = %0.5E\t\n", SUMres); printf("Sum of (y_i - a_o - a_1*x_i)^2 = %0.5E\t\n", SUM_Yres); printf("Standard deviation(St) = %0.5E\n", sqrt(SUMres / (dataSize - 1))); printf("Standard error of the estimate(Sr) = %0.5E\t\n", sqrt(SUM_Yres / (dataSize-2))); printf("Coefficent of determination(r^2) = %0.5E\t\n", (SUMres - SUM_Yres)/SUMres); printf("Correlation coefficient(r) = %0.5E\t\n", sqrt(Rsqr)); }
Explanation: assuming that each data point is structured similar to our Point struct, then our method takes an array of Points and the array's size as its parameters.
To advance any further, we must first find these sums:
- SUMx: the sum of all x values of the points
- SUMy: the sum of all y values of the points
- SUMxx: the sum of the square of x values, meaning that we square x values individually and add them togethers
- SUMxy: for each point we multiply its x value with its y value, then add all the results together to find this sum.
After that, we calculate the means of x and y values which equal sum of x values divided by the number of points and sum of y values divided by the number of points respectively.
slope and y_intercept of the best-fit function can then be determined using the means and the sums we found. Thus, the linear function is y = slope*x + y_intercept.
Finally, to calculate the coefficient of determination and standard error of estimation, we need to find the sum of squared standard deviation (SUM_Yres) of each point from the best-fit linear function and sum of the squared residues (SUMres). That's what the second for loop does.
Once, we know sum of squared residues and sum of squared standard deviation, we just apply formulas to find the coefficient and the standard error.
As you can see, the challenge is not writing the code to compute the least-square regression but being able to understand the logic behind that. Why slope, y_intercept, coefficient of determination and standard error of estimation are calculated that way? Where do the formulas come from? Etc..
Source
0 comments:
Post a Comment