Sunday, June 7, 2009

Multiple Regression Analysis Leveraging Excel in C#

Linear Regression via Wikipedia:

In statistics, linear regression is used for two things;

  • to construct a simple formula that will predict a value or values for a variable given the value of another variable.
  • to test whether and how a given variable is related to another variable or variables.

Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modelled by a least squares function, called a linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line when the predicted value (i.e. the dependent variable from the regression equation) is plotted against the independent variable: this is called a simple linear regression. However, note that "linear" does not refer to this straight line, but rather to the way in which the regression coefficients occur in the regression equation. The results are subject to statistical analysis.

In other words, if you had a hypothesis that a particular variable (denoted by y) was linearly related to one or more other variables (denoted by xi), you could use linear regression to test your hypothesis.  For example, I could test whether property prices in Singapore (my y or dependent variable) is dependent on GDP (my x or independent variable).  Furthermore, I can use “multiple” regression analysis to test the relationship between a dependent variable and multiple independent variables.  Perhaps Singapore property prices, for instance, are a function of both GDP and interest rates.  In this particular example, I have two independent variables that taken together may affect my dependent variable.  I’ll leave it at that for now as I want to jump into code, but there’s a lot more information regarding linear regression scattered across the Web.

Fortunately for many of us, Excel has a built-in function called LINEST that does all the hard work for us.  The LINEST function takes a set of known y and x points and calculates an equation for the best-fit line given the data using the least squares method.   Check out this article for a good example on using Excel to do this.  The results of this function show us the relationship between the dependent and one or more independent variables.  Interpreting the data, of course, is really a whole new topic and outside the scope of this blog post.

Okay, so how can we leverage this built-in functionality in Excel in our own .NET applications?  It’s actually quite simple:

  1. Add a reference the Excel PIA.
  2. Create an instance of the IExcel.Application interface.
  3. Declare a WorksheetFunction variable reference the Excel application’s WorksheetFunction property.
  4. Prepare the parameters for the LINEST function.
  5. Use the Workbook functions class to call the LINEST function.
  6. Parse the results.

To add a reference to the Excel PIA, right-click on the References folder in your project, select Add Reference, and in the .NET tab select Microsoft.Office.Interop.Excel.  For my particular case, I used the v12.0.0.0 version as I had Office 2007 installed.  Once you hit OK, you should see the assembly appear in your References folder.

Next, we declare an instance of the IExcel.Application interface.  This gives us a reference to an Excel instance.  We then can use the WorksheetFunction property to gain access to Excel’s functions such as LINEST.  We also need to declare a “missing” variable as we need to call the LINEST function with a missing parameter since the third parameter is optional – i.e. in Excel, we may use =LINEST(rangeY, rangeY, , true).  Here, I’ve declared them as member variables:

private IExcel.Application xl = new IExcel.Application();
private object missing = System.Type.Missing;

You can then get a reference to the WorksheetFunction interface by using the Worksheet Function property:



IExcel.WorksheetFunction wsf = xl.WorksheetFunction;

And you’re set.  To call the LINEST function, simply call xl.LinEst().  The LinEst function takes 4 arguments, just like in Excel.  For instance:



object[,] result = (object[,])wsf.LinEst(yDataPoints.ToArray(), xAll, missing, true);

For the first two parameters, you need to pass the y and x data sets as double arrays.  In my example, yDataPoints is of type List<double> and I call ToArray() to convert it to a double[] array.  I could pass another double[] array for the x variable data set, but in my in the case above, xAll represents a multi-dimensioned array to contain data points for multiple x variables.  In other words, if I’m regressing y against more than one x variable, as in a multiple regression analysis, I need to pass the data set of x variables as a multi-dimensioned array.


In my particular case, I had a List of a List of doubles.  Each List<double> represented a single data set for a particular x variable.



List<List<double>> xDataPoints = new List<List<double>>();

To convert this into a multi-dimensioned array useful for the LinEst function, one could use the following code:



double[,] xAll = new double[xDataPoints.Count, xDataPoints[0].Count];
for (int i = 0; i < xDataPoints.Count; i++)
{
    for (int j = 0; j < xDataPoints[0].Count; j++)
    {
        xAll[i, j] = xDataPoints[i][j];
    }
}

Now that we have our y data points and x data points ready, you can call the LinEst function as done above.



object[,] result = (object[,])wsf.LinEst(yDataPoints.ToArray(), xAll, missing, true);

The LinEst function returns a multi-dimensioned array that matches the format returned in Excel.  Here’s a screenshot of what the results would look like from the Microsoft article linked above.



One important note for C# developers is that the resulting array starts from index 1 and not 0.  Hence, to get the y-intercept and r-squared values, for instance, you could use:



double yIntercept = (double)result[1, numSecurities];
double rSquared = (double)result[3, 1];

Where numSecurities is the total number of variables involved (y and x variables).  To get the beta co-efficients of each x variable, you would parse the first row in backwards order (as in the image above mn is in first column and m1 is in the second to last column).


Simple, eh?  By leveraging Excel, multiple regression analysis in your C# apps couldn’t be easier…

4 comments:

  1. Thanks for a well written post!

    But this works only when xAll is a squared matrix?

    I want to do a 3nd order polynomial regression of a dataset with more than three values, and I cannot figure out how to set up the second parameter for LINEST(). Excel seems to handle non-squared matrices without problems, but doing the same through .Net Interop, I just receive a COMException saying "LinEst method of WorksheetFunction class failed".

    ReplyDelete
  2. I'm facing the same problem of not being able to provide non-square matrices as input to known x's.

    ReplyDelete
  3. I'm facing the same problem, receiving ComException saying "LinEst method of WorksheetFunction class failed". I want to implement this code,please give any solution to this. It will really be a great help.

    ReplyDelete
  4. Good post! Thanks for give a guide of how to create linear object in c#.net, multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

    ReplyDelete