Using Formulas to Define Statistical Models

In R, and the S language from which it was derived, model formulas (or model formulae as the official documentation calls them) are the standard way of specifying the variables in a model. The patsy package brought the same convenient syntax to Python. With an update we released earlier today, you can now use formulas to specify models in the Extreme Optimization Numerical Libraries for .NET.

Formula basics

Formulas work for almost all types of models. Here, we will use regression as an example. A regression model is defined by one dependent variable and one or more independent variables. To specify a model with dependent variable y and independent variables x1 and x2, we can use the formula:

y ~ x1 + x2

In this formula, the ~ operator separates the dependent variable from the independent variables, and the + operator indicates that x1 and x2 should both be included in the model.

It is useful to think of formulas as operating on sets of variables. So the terms y, x1, and x2 each stand for a set of one variable. The + operator stands for the union of the set of terms on the left and the set of terms on the right. It follows that including the same variable twice results in the same formula:

y ~ x1 + x1 + x2 + x1 + x2

is exactly the same as

y ~ x1 + x2

Other operators exist that specify different combinations of variables: They are all really just set operations. We will get to those in a minute.

In many multivariate models like clustering models, factor analysis or principal component analysis (PCA), there are no dependent variables, only features. In this case, the first part of the formula is omitted.

The Intercept

The special term 1 stands for the intercept (constant) term in a model. Because nearly all linear models include an intercept term, one is included by default. So, again for a regression, the model

y ~ x1 + x2

is equivalent to

y ~ 1 + x1 + x2

To exclude the intercept term, there are two options. The - operator can be used. This operator corresponds to the set difference between terms: it returns the set of terms in its right operand with the terms on the left removed:

y ~ x1 + x2 - 1

Alternatively, the special no-intercept term, 0, can be added as the first term in the model:

y ~ 0 + x1 + x2

Interactions

Models may include not just main effects but also interactions between terms. The interaction between two numerical variables is the element-wise product of the two variables. Interactions are specified by the : operator. For example:

y ~ x1 + x2 + x1:x2

It is very common to include both the main effects and the interactions in a model. It is so common that there is a special operator for this purpose: the product operator *. So, the above model can also be written much shorter as:

y ~ x1*x2

There is another shortcut operator: ** or ^ (both forms are equivalent). The right operand has to be an integer. It specifies how many times the left operand should be repeated in a product. So, the model

y ~ (x1 + x2)**3

is equivalent to

y ~ (x1 + x2)*(x1 + x2)*(x1 + x2)

Categorical variables

Most models require variables to be numerical. To include categorical variables in a model, they have to be encoded as numerical variables. This is done automatically In a formula like

y ~ x + a

where a is a categorical variable, the term a really means “the set of variables needed to fully represent the variable a in the model.”

The standard way to encode categorical variables is called dummy encoding or one hot encoding. For each level of the categorical variable, a corresponding indicator variable is generated that has a 1 where the variable’s value equals that level, and a 0 otherwise. That’s not the full story, however.

The sum of all indicator variables for a categorical variable is a variable with ones everywhere. This is exactly the same as an intercept term. So, if an intercept term is present in a model, then for a categorical variable with 2 levels, you only need 1 indicator variable in the model. For a variable with 3 levels, you only need 2. Adding the last indicator variable does not add any more information. It is redundant.

Handling redundancies, especially for models that include interactions between variables, can get somewhat complicated. Complicated enough for R to get it wrong sometimes. No worries: we’ve done it right.

The Catch All Term

Often, the data set used to fit a model contains just the variables that are used in the model. In this case, it is convenient to use the special . term. This term captures all the variables in the data set that have not yet appeared in the model. So, for example, if the data set contains the variables y, x1, x2, …, x17, but nothing else, then instead of writing out the full formula:

y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x11 + x12 + x13 + x14 + x15 + x16 + x17

you can simply write:

y ~ .

A code example

This example is adapted from the linear regression QuickStart sample. We’ll use C#, but the code is also available in Visual Basic and F#.

The example uses information about the test scores of 200 high school students. We’ll try to predict the science score in terms of some other variables: the scores in reading, math, and social studies, and the gender of the student.

var data = DataFrame.ReadCsv(@"..\..\..\..\Data\hsb2.csv");
var formula = "math + female + socst + read";
var model = new LinearRegressionModel(data, formula);
model.Compute();
Console.WriteLine(model.Summarize());

We start by reading the data from a CSV file into a data frame. We can then go straight to setting up the model using a formula, and computing it. We then print a summary of the results. This gives us the following output:

# observations: 200
R Squared:            0.4892        Log-likelihood:      -674.63
Adj. R Squared:       0.4788        AIC:                 1359.25
Standard error:       7.1482        BIC:                 1375.75
===================================================================
Source of variation   Sum.Sq.   D.o.f. Var.Est.  F-stat.  prob.
Regression            9543.72     4.00  2385.93  46.6948   0.0000
                      9963.78   195.00    51.10
                     19507.50   199.00    98.03
===================================================================
Parameter                Value   St.Err.       t   p(>t)
Constant              12.32529   3.19356    3.86  0.0002
math                   0.38931   0.07412    5.25  0.0000
female                -2.00976   1.02272   -1.97  0.0508
socst                  0.04984   0.06223    0.80  0.4241
read                   0.33530   0.07278    4.61  0.0000

We see that math an reading scores were most significant. In this group, girls scored 2 points lower than boys, on average.

Limitations

This is just our initial implementation of model formulas. Not everything works entirely as we would like it, and there are some limitations.

Expressions are not supported.

This is the biggest limitation. Formulas in R and in patsy can contain expressions of variables. So, for example, you can have

log(y) ~ x1 + x2

In this example, the dependent variable is the natural logarithm of the variable y. When the expression contains operators that could be confused with formula operators, the special I() function is used to indicate that a term is to be evaluated as a mathematical expression:

y ~ x1 + I(x1+x2)

This formula specifies a model with 2 independent variables: x1 and the sum of x1 and x2.  Such expressions are not supported in this release.

Special purpose variables are not supported.

Some models have variables that play a special role. For example, Poisson regression models sometimes have an offset variable that is associated with a parameter fixed at 1. In R, the offset parameter is indicated by the expression offset(variableName). This is currently not supported.

The ‘nested in’ operatorS / or %in% ARE not supported

These operators are not very common and a little confusing, so we decided to leave them out, at least for now.

The no-intercept term must appear first.

Adding the no-intercept term, 0, anywhere other than at the start of the independent variables or features has undefined behavior.

Conclusion

In only a few lines of code, we've loaded a data set, set up a model and printed a summary of the results. We believe that specifying models using formulas is a very useful feature. You can read more in the documentation. The QuickStart samples have also been updated.

Of course, there's nothing like trying it out yourself. Get the update here!

Data Frame Library Preview

One of our core objectives in creating our Numerical Libraries for .NET was to try to close the gap between the exploratory phase of development and the implementation phase, when the results are incorporated into applications. We wanted to address the needs of one of the fastest growing categories of users: data scientists.

Several tools are available already, but few come close to fulfilling our objectives. R is the most widely used platform for data analysis. Unfortunately the interaction with .NET is painful at best. For Python users, pandas is an excellent choice. It has been around for a few years and in that time has really matured in terms of API and performance. But of course, it’s Python only.

Blue Mountain Capital recently released Deedle, an open source library for exploratory data analysis written in F# by the brilliant Tomas Petricek and others. The library is fairly complete and has a nice API. especially for F# users. Using F#’s interactive REPL with Deedle, it is truly possible to do interactive exploratory data analysis with .NET.

Although a good effort has been made to make the library easy to use from other languages like C# and VB.NET, it is still more awkward than it could be. However, as we’ll see, the biggest problem with Deedle is performance.

Our own library for exploratory data analysis, which is well on its way to completion, is still convenient but maintains a high level of performance comparable to, and sometimes exceeding, that of pandas. I wanted to give you a preview of what’s coming.

Data Frame basics

The basic object in data analysis is a data frame. A data frame looks a lot like a database table. It is a two-dimensional table where the values in each column must be the same, but values in different columns can have different types. It also has a row index and a column index. The column index is used to select specific columns, usually with strings as keys.

Similarly, the row index can be used to select rows, but it also enables automatic data alignment. Imagine two time series, where some dates that are present in one series may not be present in another. If you want to compute the difference between the values in the two series, you want to make sure that you are comparing values for the same date. Here’s a quick example:

var s1 = Vector<double>.Create(
    new double[] { 1, 2, 3, 4 },
    new string[] { "a", "b", "c", "d" });
var s2 = Vector<double>.Create(
    new double[] { 40, 50, 10, 30 },
    new string[] { "d", "e", "a", "c" });
Console.WriteLine(s1);
Console.WriteLine(s2);
Console.WriteLine(s1 + s2);

We first create two vectors indexed using some strings. You’ll notice that the second vector has an extra key (e), and another key (b) is missing. If we add the two vectors, here is what we get:

a: 11
b: -
c: 33
d: 44
e: -

The result has a value for all five keys. It has added values with corresponding keys correctly. Where a key was present in only one vector, a missing value was returned.

All vectors in the Extreme Optimization Numerical Libraries for .NET may have an index, and all matrices may have a row and a column index. Any operations involving indexed arrays will automatically align the data on the indexes.

Indexes on vectors and matrices are weakly typed in that the type of the keys is not part of the type definition. The keys are stored as strongly typed collections, of course.

A data frame is different from an indexed matrix in two ways. First, not all elements must be of the same type. Different columns can have different element types. Second, the indexes on a data frame are strongly typed, while the columns are weakly typed. You can easily access rows and columns by key, but if you want a strongly typed column, you need to do a tiny bit of extra work:

var df = DataFrame.Create<string, string>(
    new string[] { "First", "Second" },
    new Vector<double>[] { s1, s2 });
Console.WriteLine(df);

var s3 = df["First"].As<double>() + 10.0;
Console.WriteLine(s3);

This creates a data frame using the two vectors we used earlier.  We then get the first column, get it as a double, and add 10 to it. The output:

df:                 s3:
   First Second
a: 1     10         a: 11
b: 2     -          b: 21
c: 3     30         c: 31
d: 4     40         d: 41
e: -     50         e: -

Notice that the row index on the data frame once again contains all keys in the two indexes, and missing values have been inserted in a vector where the key was not present.

Titanic survivor analysis in 20 lines of C#

Let;s do something more complicated. The Deedle home page starts with an example that computes the percentages of people who survived the Titanic disaster grouped by gender. The code shown is currently 20 lines. Using our library, the C# code is also 20 lines, and would look like this:

var titanic = DataFrame.ReadCsv(filename);
// Compute pivot table of counts, grouped by Sex and Survived:
var counts = titanic
    .PivotBy<string, int>("Sex", "Survived")
    .Counts()
    .WithColumnIndex(new string[] { "Died", "Survived" });

// Add a column that contains the total count:
counts["Total"] = counts["Died"].As<double>() + counts["Survived"].As<double>();

// Create the data frame that contains the percentages:
var percentages = DataFrame.Create<string, string>(
    new Dictionary<string, object>() {
        { "Died (%)", 100 * Vector<double>.ElementwiseDivide(
             counts["Died"].As<double>(),
             counts["Total"].As<double>()) },
        { "Survived (%)", 100 * Vector<double>.ElementwiseDivide(
             counts["Survived"].As<double>(),
             counts["Total"].As<double>()) }
    });

After reading the data into a DataFrame from a CSV file, we can compute a pivot table that shows the count for each combination of Sex and Survived. We then add a column that computes the total number of people for each gender. Finally, we construct a new data frame by dividing the first two columns by the total column.

F# has somewhat more succinct syntax, thanks mainly to its better type inference and support for custom operators. We can do the same thing in just 14 lines:

let titanic = DataFrame.ReadCsv(filename)
// Compute pivot table of counts, grouped by Sex and Survived:
let counts =
    titan.PivotBy<string, int>("Sex", "Survived")
         .Counts()
         .WithColumnIndex([| "Died"; "Survived" |]);

// Add a column that contains the total count:
counts?Total <- counts?Died + counts?Survived

// Create the data frame that contains the percentages:
DataFrame.Create<string, string>([
    ("Died (%)" , 100.0 * counts?Died ./ counts?Total ) ;
    ("Survived (%)" , 100.0 * counts?Survived ./ counts?Total )])

Down the road we will be adding some more F# extras.

Comparing Deedle and Extreme Optimization Performance

So, how does this code perform? Well, we added some timing code, making sure to ‘warm up’ the code so we’re only testing the actual running time and not time spent JIT’ing the code. Here is what we found on an older i7 930:

Platform: x86
Example from Deedle home page (100 iterations)
Extreme Optimization:  11.137 ms (0.1114 ms/iteration)
Deedle:              5395.866 ms (53.9587 ms/iteration)
Ratio: 484.49

Platform: x64
Example from Deedle home page (100 iterations)
Extreme Optimization:  11.049 ms (0.1105 ms/iteration)
Deedle:              8208.071 ms (82.0807 ms/iteration)
Ratio: 742.91

That’s right: our implementation is around 500 times faster than Deedle. We were as surprised as anyone by these results. You just don’t get these kinds of performance differences very often. So we looked a bit further. We summarized our findings in a chart:

CompareExtremeOptimizationDeedle

The chart compares running times (in ms) of 9 benchmarks using the Extreme Optimization library and Deedle running as both x86 and x64. Note that we had to use a logarithmic scale for the running time!

The conclusion: the Extreme Optimization library is 1 to 3 orders of magnitude faster than Deedle.

A couple of points on why these results aren’t entirely fair. First, the two versions don’t always do the same thing. For example: in the time series benchmark, we take advantage of the fact that the indexes are sorted to speed up the alignment. Verifying that the index is sorted is only done once. There are no doubt other cases where our code incurs a cost only once. Second, we use some specific optimizations that could be implemented in Deedle with relative ease. For example: we use an incremental method to compute the moving average, whereas Deedle re-computes the average for each window from scratch.

Still, a difference of 1-3 orders of magnitude is huge, and it begs for an explanation. I believe it’s rather simple, actually:

  • Performance was not a priority in the development of Deedle, especially in this initial release. (We used the first official release version, 0.9.12.) We expect performance to improve as the library matures. Still, 3 orders of magnitude will be hard to make up. Which brings me to my other point:
  • Sometimes, the F# language makes it a little too easy to write compact, elegant code that is very inefficient. For example: Types like tuples and discriminated unions are very convenient, but they are reference types, and so they have consequences for memory locality, cache efficiency, garbage collections…

Summary

Our upcoming data frame library will provide data scientists with a convenient way of interactively working with data sets with millions of rows. Compared to the only other library of its kind, Deedle, performance is currently 1 to 3 orders of magnitude better. A beta version of the library will be published soon. Contact me if you want to participate.

New feature: Wilcoxon-Mann-Whitney and Kruskal-Wallis Tests

We just released an update to our Extreme Optimization Numerical Libraries for .NET that adds some new classes. We’ve added two non-parametric tests: the Mann-Whitney and Kruskal-Wallis tests. These are used to test the hypothesis that two or more samples were drawn from the same distribution. The Mann-Whitney test is used for two samples. The Kruskal-Wallis test is used when there are two or more samples.

For both tests, the test statistic only depends on the ranks of the observations in the combined sample, and no assumption about the distribution of the populations is made. This is the meaning of the term non-parametric in this context.

The Mann-Whitney test, sometimes also called the Wilcoxon-Mann-Whitney test or the Wilcoxon Rank Sum test, is often interpreted to test whether the median of the distributions are the same. Although a difference in median is the dominant differentiator if it is present, other factors such as the shape or the spread of the distributions may also be significant.

For relatively small sample sizes, and if no ties are present, we return an exact result for the Mann-Whitney test. For larger samples or when some observations have the same value, the common normal approximation is used.

The Kruskal-Wallis test is an extension of the Mann-Whitney test to more than two samples. We always use an approximation for the distribution. The most common approximation is through a Chi-square distribution. We chose to go with an approximation in terms of the beta distribution that is generally more reliable, especially for smaller sample sizes. For comparison with other statistical software, the chi-square p-value is also available.

We created some QuickStart samples that illustrate how to use the new functionality:

You can also view the documentation on non-parametric tests, or download the trial version.