Extreme Optimization™: Complexity made simple.

Numerical Components
for .NET

  • Home
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • Resources
    • Downloads
    • QuickStart Samples
    • Sample Applications
    • Frequently Asked Questions
    • Technical Support
  • Blog
  • Order
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Contact us
Introduction
Deployment Guide
Using Parallelism
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand ReferenceReference
  • Home
    • Features
    • Solutions
    • Documentation
    • QuickStart Samples
    • Sample Applications
    • Downloads
    • Technical Support
    • Download trial
    • How to buy
    • Blog
    • Company
    • Resources
  • Documentation
    • Introduction
    • Deployment Guide
    • Using Parallelism
    • Mathematics Library User's Guide
    • Vector and Matrix Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • Statistics Library User's Guide
    • Statistical Variables
    • Continuous Variables
    • Categorical Variables
    • Variable Collections
    • General Linear Models
    • Regression Analysis
    • Analysis of Variance
    • Time Series Analysis
    • Multivariate Analysis
    • Continuous Distributions
    • Discrete Distributions
    • Multivariate Distributions
    • Hypothesis Tests
    • Histograms
    • Random Numbers
    • Appendices
  • Hypothesis Tests
    • Hypothesis Test Basics
    • Testing Means
    • Testing Variances
    • Testing Goodness-Of-Fit
    • Testing Homogeneity of Variances
    • Non-Parametric Tests
  • Testing Goodness-Of-Fit
Collapse image Expand Image Copy image CopyHover image
         




Testing Goodness-Of-Fit

It is often necessary to verify whether the distribution of a variable fits a certain theoretical distribution. Goodness-of-fit tests can be used to perform this verification. Goodness of fit tests require all sample values. They can't be performed using only the summary statistics.

The Chi-Square Test for Goodness-of-Fit

The chi-square goodness-of-fit test compares observed cell frequencies from a sample with the cell frequencies expected from the proposed underlying distribution. The test is based on the assumption that the variable is categorical in nature. When the variable is continuous, the chi-square test cannot be used directly. It is possible to group the data into cells and use the categorized data in the test. The outcome of the test depends on how the continuous data is grouped, so it may not be as reliable.

Two other assumptions are made, namely that the sample is randomly selected from the population, and that the expected frequency of each cell is at least 5. If either of these assumptions is violated, the reliability of the chi-square test may be compromised.

The test statistic is calculated from the difference between the expected and actual cell frequencies. The distribution of the statistic is approximated by the chi square distribution. The approximation is better the higher the expected cell frequencies.

The null hypothesis is that the observed cell frequencies are equal to the expected frequencies. The alternative hypothesis is that at least one cell frequency is different from its expected value.

This test should not be confused with the chi-square test for the variance of a distribution.

The chi-square goodness-of-fit test is implemented by the OneSampleChiSquareTest class.

Example 1 - Fitting a Discrete Distribution

In a gambling game, the payout is directly proportional to the number of sixes that are thrown. A very successful customer has the following results:

# sixes

# throws

0

52

1

35

2

11

3

2

The casino management suspects that the customer may be using weighted dice. The significance level for this test is 0.01.

The number of sixes thrown follow a binomial distribution with p = 1/6. The expected values can be calculated easily using the GetExpectedHistogram(Int32, Int32, Double) method of the BinomialDistribution. We then compare the results to the actual:

C#  Copy imageCopy
BinomialDistribution sixesDistribution =    new BinomialDistribution(3, 1/6.0);
Histogram expected = sixesDistribution.GetExpectedHistogram(100);
Histogram actual = new Histogram(0, 4, new double[] {51, 35, 12, 2});
ChiSquareGoodnessOfFitTest chiSquare =    new ChiSquareGoodnessOfFitTest(actual, expected);
chiSquare.SignificanceLevel = 0.01;
Console.WriteLine("Test statistic: {0:F4}", chiSquare.Statistic);
Console.WriteLine("P-value:        {0:F4}", chiSquare.PValue);
Console.WriteLine("Reject null hypothesis? {0}", 
    chiSquare.Reject() ? "yes" : "no");
Visual Basic  Copy imageCopy
Dim sixesDistribution As BinomialDistribution = _
    New BinomialDistribution(3, 1 / 6.0)
Dim expected As Histogram = sixesDistribution.GetExpectedHistogram(100)
Dim actual As Histogram = New Histogram(0, 4, New Double() {51, 35, 12, 2})
Dim chiSquare As ChiSquareGoodnessOfFitTest = _
    New ChiSquareGoodnessOfFitTest(actual, expected)
chiSquare.SignificanceLevel = 0.01
Console.WriteLine("Test statistic: {0:F4}", chiSquare.Statistic)
Console.WriteLine("P-value:        {0:F4}", chiSquare.PValue)
Console.WriteLine("Reject null hypothesis? {0}", _
    IIf(chiSquare.Reject(), "yes", "no"))

The value of the chi-square statistic is 9.6013 giving a p-value of 0.0223. As a result, the hypothesis that the dice are weighted is rejected at the 0.01 level.

The One Sample Kolmogorov-Smirnov Test

The one sample Kolmogorov-Smirnov test (KS test) is a one sample test that is used to test the hypothesis that a given sample was taken from a proposed continuous distribution. The test statistic is based on a comparison of the empirical distribution of the sample to the proposed distribution.

One of the advantages of the KS test is that it can be applied to any continuous distribution. On the other hand, it can't be applied to discrete distributions, and is more sensitive near the center of the distribution than at the tails.

The biggest drawback is that the distribution must be completely specified. If one or more of the distribution's parameters is estimated, the distribution of the test statistic is different from the Kolmogorov-Smirnov distribution.

The null hypothesis is always that the population underlying the sample has the proposed distribution. The alternative hypothesis is that the population does not have the proposed distribution.

There is also a two sample Kolmogorov-Smirnov test, which is used to test whether two samples were taken from the same, unknown distribution.

The one sample Kolmogorov-Smirnov test is implemented by the OneSampleKolmogorovSmirnovTest class. It has three constructors. The second constructor has two parameters. The first is a NumericalVariable object that specifies the sample. The second is a Func<(Of <<'(Double, Double>)>>) delegate, which specifies the cumulative distribution function of the distribution being tested. The third constructor also takes two parameters. The first parameter is once again a NumericalVariable object. The second parameter must be of a type derived from ContinuousDistribution.

Example

In this example, we take samples of a lognormal distribution, and test whether it could come from a similar looking Weibull distribution.

C#  Copy imageCopy
WeibullDistribution weibull = new WeibullDistribution(2, 1);
LognormalDistribution logNormal = new LognormalDistribution(0, 1);
DenseVector logNormalData = Vector.Create(25);
logNormal.GetRandomVariates(new System.Random(), logNormalData);
NumericalVariable logNormalSample = new NumericalVariable(logNormalData);
OneSampleKolmogorovSmirnovTest ksTest = 
    new OneSampleKolmogorovSmirnovTest(logNormalSample, weibull);
Console.WriteLine("Test statistic: {0:F4}", ksTest.Statistic);
Console.WriteLine("P-value:        {0:F4}", ksTest.PValue);
Console.WriteLine("Reject null hypothesis? {0}", 
    ksTest.Reject() ? "yes" : "no");
Visual Basic  Copy imageCopy
Dim weibull As WeibullDistribution = New WeibullDistribution(2, 1)
Dim logNormal As LognormalDistribution = New LognormalDistribution(0, 1)
Dim logNormalData As DenseVector = Vector.Create(25)
logNormal.GetRandomVariates(New System.Random, logNormalData)
Dim logNormalSample As NumericalVariable = New NumericalVariable(logNormalData)
Dim ksTest As OneSampleKolmogorovSmirnovTest = _
    New OneSampleKolmogorovSmirnovTest(logNormalSample, weibull)
Console.WriteLine("Test statistic: {0:F4}", ksTest.Statistic)
Console.WriteLine("P-value:        {0:F4}", ksTest.PValue)
Console.WriteLine("Reject null hypothesis? {0}", _
    IIf(ksTest.Reject(), "yes", "no"))

First we create a Weibull and a lognormal distribution. We then create a DenseVector and fill it with random variates from the lognormal distribution using its Sample method. We then create a NumericalVariable from the vector.

Because we use random samples, the results of the test are different on each run. The trend is that the p-value is anywhere from 0.03 to 0.3. We can conclude from this that it is not possible to distinguish a lognormal distribution from a Weibull distribution using only 25 sample points.

The Two Sample Kolmogorov-Smirnov Test

The two sample Kolmogorov-Smirnov test is used to test the hypothesis that two samples come from a population with the same, unknown distribution.

The null hypothesis is always that the two samples come from the same underlying distribution. The alternative hypothesis is always that the two samples come from different distributions.

The two sample Kolmogorov-Smirnov test is implemented by the TwoSampleKolmogorovSmirnovTest class. It has two constructors. The first constructor takes no arguments. The second constructor has two parameters. Both are NumericalVariable objects that specify the two sample that are being compared.

Example

We investigate whether we can distinguish a sample taken from a lognormal distribution from a sample taken from a similar looking Weibull distribution. We use the lognormal samples we created in the previous section.

C#  Copy imageCopy
DenseVector weibullData = Vector.Create(25);
weibull.GetRandomVariates(new System.Random(), weibullData);
NumericalVariable weibullSample = new NumericalVariable(weibullData);
TwoSampleKolmogorovSmirnovTest ksTest2 = 
    new TwoSampleKolmogorovSmirnovTest(logNormalSample, weibullSample);
Console.WriteLine("Test statistic: {0:F4}", ksTest2.Statistic);
Console.WriteLine("P-value:        {0:F4}", ksTest2.PValue);
Console.WriteLine("Reject null hypothesis? {0}", 
    ksTest2.Reject() ? "yes" : "no");
Visual Basic  Copy imageCopy
Dim weibullData As DenseVector = Vector.Create(25)
weibull.GetRandomVariates(New System.Random, weibullData)
Dim weibullSample As NumericalVariable = New NumericalVariable(weibullData)
Dim ksTest2 As TwoSampleKolmogorovSmirnovTest = _
    New TwoSampleKolmogorovSmirnovTest(logNormalSample, weibullSample)
Console.WriteLine("Test statistic: {0:F4}", ksTest2.Statistic)
Console.WriteLine("P-value:        {0:F4}", ksTest2.PValue)
Console.WriteLine("Reject null hypothesis? {0}", _
    IIf(ksTest2.Reject(), "yes", "no"))

The Anderson-Darling Test for Normality

The Anderson-Darling test is a one sample test of normality. It is a variation of the Kolmogorov-Smirnov test that assigns more weight to the tails of the distribution. Unlike the Kolmogorov-Smirnov test, the distribution of the test statistic is dependent on the distribution. The parameters of the distribution are estimated from the sample.

The null hypothesis is always that the population underlying the sample follows a normal distribution. The alternative hypothesis is always that the underlying population does not follow a normal distribution.

The Anderson-Darling test is implemented by the AndersonDarlingTest class. It has three constructors. The first constructor has no parameters. The second constructor has one parameter: a NumericalVariable object that specifies the sample to be tested. The third constructor has three parameters. The first is once again a NumericalVariable that specifies the sample. The second and third parameters are the mean and standard deviation of the normal distribution being tested. If no values are provided, the values are estimated from the sample.

Example

We investigate the strength of polished airplane windows. We want to verify that the measured strengths follow a normal distribution. We have a total of 31 samples.

C#  Copy imageCopy
NumericalVariable strength = new NumericalVariable(new double[] 
    {18.830, 20.800, 21.657, 23.030, 23.230, 24.050, 
        24.321, 25.500, 25.520, 25.800, 26.690, 26.770, 
        26.780, 27.050, 27.670, 29.900, 31.110, 33.200, 
        33.730, 33.760, 33.890, 34.760, 35.750, 35.910, 
        36.980, 37.080, 37.090, 39.580, 44.045, 45.290,
        45.381});
AndersonDarlingTest adTest = new AndersonDarlingTest(strength, 30.81, 7.38);
Console.WriteLine("Test statistic: {0:F4}", adTest.Statistic);
Console.WriteLine("P-value:        {0:F4}", adTest.PValue);
Console.WriteLine("Reject null hypothesis? {0}", 
    adTest.Reject() ? "yes" : "no");
Visual Basic  Copy imageCopy
Dim strength As NumericalVariable = New NumericalVariable(New Double() _
    {18.83, 20.8, 21.657, 23.03, 23.23, 24.05, _
        24.321, 25.5, 25.52, 25.8, 26.69, 26.77, _
        26.78, 27.05, 27.67, 29.9, 31.11, 33.2, _
        33.73, 33.76, 33.89, 34.76, 35.75, 35.91, _
        36.98, 37.08, 37.09, 39.58, 44.045, 45.29, _
        45.381})
Dim adTest As AndersonDarlingTest = New AndersonDarlingTest(strength, 30.81, 7.38)
Console.WriteLine("Test statistic: {0:F4}", adTest.Statistic)
Console.WriteLine("P-value:        {0:F4}", adTest.PValue)
Console.WriteLine("Reject null hypothesis? {0}", _
    IIf(adTest.Reject(), "yes", "no"))

The value of the Anderson-Darling statistic is 0.5322, corresponding to a p-value of 0.8263. We conclude that the window strengths do follow a normal distribution.

Send comments on this topic to support@extremeoptimization.com

Copyright (c) 2004-2011 ExoAnalytics Inc.

Copyright © 2003-2013, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.