Extreme Optimization™: Complexity made simple.

Math and Statistics
Libraries for .NET

  • Home
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • Resources
    • Downloads
    • QuickStart Samples
    • Sample Applications
    • Frequently Asked Questions
    • Technical Support
  • Blog
  • Order
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Partners
    • Contact us
Introduction
Deployment Guide
Nuget packages
Configuration
Using Parallelism
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Data Analysis Library User's GuideData Analysis Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand Data Access Library User's GuideData Access Library User's Guide
Expand ReferenceReference
  • Extreme Optimization
    • Features
    • Solutions
    • Documentation
    • QuickStart Samples
    • Sample Applications
    • Downloads
    • Technical Support
    • Download trial
    • How to buy
    • Blog
    • Company
    • Resources
  • Documentation
    • Introduction
    • Deployment Guide
    • Nuget packages
    • Configuration
    • Using Parallelism
    • Mathematics Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Data Access Library User's Guide
    • Reference
  • Data Analysis Library User's Guide
    • Indexes
    • Data Frames
    • Data wrangling
    • Grouping and Aggregation
    • Working with Categorical Data
    • Working with Time Series Data
  • Data Frames
    • Constructing data frames
    • Basic Operations on Data Frames
    • Vectors and matrices as data frames
  • Constructing data frames

Constructing data frames

Extreme Optimization Numerical Libraries for .NET Professional

A data frame is an collection of columns that may have different element types and that has indexed access to rows and columns. The minimal functionality of a data frame is captured by the IDataFrame interface. The main implementation of a data frame is the DataFrameR, C class. Vectors and matrices also implement IDataFrame.

Constructing data frames

The DataFrameR, C class itself has no constructors. Instead, data frames are created by performing operations on existing data frames, or by calling one of the factory methods of the static DataFrame class. All these methods take the type of the row and column keys as generic type arguments. However, in most cases they can be inferred from the arguments so they can be omitted.

The simplest method, CreateEmptyR, C, takes no arguments. It creates an empty data frame. The type of the row keys and the column keys must be specified as generic type arguments. You can add and remove columns using the methods in the next section. The first column that is added determines the row index.

There are several ways to create a data frame from a set of vectors. The method is called FromColumns and has multiple overloads. There are two basic mechanisms: you can specify a dictionary that maps column keys to the column values, or you can use separate collections of column keys and columns. When using a dictionary, it is the first argument:

C#
VB
C++
F#
Copy
var data = new Dictionary<string, object>() {
        { "state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" } },
        { "year", new int[] { 2000, 2001, 2002, 2001, 2002 } },
        { "pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 } }
    };
var df1 = DataFrame.FromColumns(data);
Dim data = New Dictionary(Of String, Object)() From {
        {"state", {"Ohio", "Ohio", "Ohio", "Nevada", "Nevada"}},
        {"year", {2000, 2001, 2002, 2001, 2002}},
        {"pop", {1.5, 1.7, 3.6, 2.4, 2.9}}}
Dim df1 = DataFrame.FromColumns(data)

No code example is currently available or this language may not be supported.

let (=>) a b = (a, box b)
let ofDict x =
    let d = Dictionary<_, obj>()
    Seq.iter (fun kvp -> d.Add(fst kvp, snd kvp)) x
    d
let data = ofDict [ 
            "state" => [| "Ohio"; "Ohio"; "Ohio"; "Nevada"; "Nevada" |]
            "year" => [| 2000; 2001; 2002; 2001; 2002 |]
            "pop" => [| 1.5; 1.7; 3.6; 2.4; 2.9 |] ]
let df1 = DataFrame.FromColumns<string>(data)

The second argument is optional and specifies the row index. If no row index is provided, the first index of the correct type found in one of the columns is used. If no row index key type is specified, row numbers are used as keys.

C#
VB
C++
F#
Copy
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
    { "first", new double[] { 11, 14, 17, 93, 55 } },
    { "second", new double[] { 22, 33, 43, 51, 69 } } },
    Index.CreateDateRange(new DateTime(2015, 4, 1), 5));
Dim df2 = DataFrame.FromColumns(New Dictionary(Of String, Object)() From {
    {"first", {11, 14, 17, 93, 55}},
    {"second", {22, 33, 43, 51, 69}}},
    Index.CreateDateRange(New DateTime(2015, 4, 1), 5))

No code example is currently available or this language may not be supported.

let df2 = DataFrame.FromColumns(ofDict
            [
              "first" => [| 11.0; 14.0; 17.0; 93.0; 55.0 |],
              "second" => [| 22.0; 33.0; 43.0; 51.0; 69.0 |]
            ], 
            Index.CreateDateRange(new DateTime(2015, 4, 1), 5))

It is also possible to supply a column index for the new data frame. In this case, only columns that are present in the column index are included in the new data frame. If a key in the column index cannot be found in the dictionary, the corresponding column is still included, but it will consist entirely of missing values, as in the following example where the key 'debt' is not in the dictionary:

C#
VB
C++
F#
Copy
var df2a = DataFrame.FromColumns(data,
    Index.Create(new[] { "one", "two", "three", "four", "five" }),
    Index.Create(new[] { "year", "state", "pop", "debt" }));
Dim df2a = DataFrame.FromColumns(data,
    Index.Create({"one", "two", "three", "four", "five"}),
    Index.Create({"year", "state", "pop", "debt"}))

No code example is currently available or this language may not be supported.

let df2a = DataFrame.FromColumns(data,
            Index.Create([| "one"; "two"; "three"; "four"; "five"  |]),
            Index.Create([| "year"; "state"; "pop"; "debt"  |]))

A data frame may be created from a sequence of columns and a sequence of keys separately. The relevant overloads take two arguments. The first is a sequence of vectors. This may be a strongly typed vector or, if the columns have different element types, a sequence of IVector objects. The second argument is a sequence of column keys:

C#
VB
C++
F#
Copy
var df3 = DataFrame.FromColumns(new Vector<double>[] {
    Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0),
    Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0) },
    Index.Create(new[] { "First", "Second" }));
Dim df3 = DataFrame.FromColumns(New Vector(Of Double)() {
    Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0),
    Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0)},
    Index.Create({"First", "Second"}))

No code example is currently available or this language may not be supported.

let columns1 : IVector[] = 
    [|
        Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0) ;
        Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0) 
    |]
let df3 = DataFrame.FromColumns(columns1,
            Index.Create([| "First"; "Second" |]))

The next option is to supply a set of tuples where the first item is the column key and the second item is a list of values. This overload simply takes a (parameter) array of tuples as its only argument:

C#
VB
C++
F#
Copy
var df5 = DataFrame.FromColumns(
    ("state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" }),
    ("year", new int[] { 2000, 2001, 2002, 2001, 2002 }),
    ("pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 }));
Dim CreateTuple As Func(Of String, Object, Tuple(Of String, Object)) =
    Function(x, y) Tuple.Create(x, y)
Dim df5 = DataFrame.FromColumns(
    ("state", {"Ohio", "Ohio", "Ohio", "Nevada", "Nevada"}),
    ("year", {2000, 2001, 2002, 2001, 2002}),
    ("pop", {1.5, 1.7, 3.6, 2.4, 2.9}))

No code example is currently available or this language may not be supported.

let df5 = DataFrame.FromColumns(
            struct("state", box [| "Ohio"; "Ohio"; "Ohio"; "Nevada"; "Nevada" |]),
            struct("year", box [| 2000; 2001; 2002; 2001; 2002 |]),
            struct("pop", box [| 1.5; 1.7; 3.6; 2.4; 2.9 |]))

Another way to construct a data frame is from a matrix. This can be done in two ways. The simplest is to call the ToDataFrame method on the matrix. The type of the row keys and the column keys must be specified as generic type arguments and must match the type of the existing indexes of the matrix.

C#
VB
C++
F#
Copy
var a = Matrix.CreateRandom(100, 5);
a.RowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
a.ColumnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df7 = a.ToDataFrame<DateTime, string>();
Dim a = Matrix.CreateRandom(100, 5)
a.RowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
a.ColumnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df7 = a.ToDataFrame(Of DateTime, String)()

No code example is currently available or this language may not be supported.

let a = Matrix.CreateRandom(100, 5)
a.RowIndex <- Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
a.ColumnIndex <- Index.Create([| "a"; "b"; "c"; "d"; "e" |])
let df7 = a.ToDataFrame<DateTime, string>()

You can also supply the row and column indexes as arguments to an overload of this method. In this case, the generic type arguments can be inferred:

C#
VB
C++
F#
Copy
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);
Dim b = Matrix.CreateRandom(100, 5)
Dim rowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
Dim columnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df8 = a.ToDataFrame(rowIndex, columnIndex)

No code example is currently available or this language may not be supported.

let b = Matrix.CreateRandom(100, 5)
let rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
let columnIndex = Index.Create([|"a"; "b"; "c"; "d"; "e" |])
let df8 = a.ToDataFrame(rowIndex, columnIndex)

Alternatively, you can call the FromMatrix method. This method takes three arguments: the matrix, the row index and the column index. If the row or column index are null, the corresponding index from the matrix is used. If it does not have an index of the right type, an InvalidOperationException is thrown.

A data frame can be created from a sequence or list of .NET objects. The FromObjects method takes one generic type argument: the type of the objects, and can usually be inferred. There are two overloads. The first overload takes one argument: the sequence of objects of the specified type. This method returns a data frame with one row for each object in the sequence and one column for each public property. The column keys correspond to the names of the properties. The second overload takes as an optional second argument a list of the properties that should be included in the data frame. The order in which the properties are listed is preserved in the data frame. The following example illustrates both overloads:

C#
VB
C++
F#
Copy
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);
Dim b = Matrix.CreateRandom(100, 5)
Dim rowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
Dim columnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df8 = a.ToDataFrame(rowIndex, columnIndex)

No code example is currently available or this language may not be supported.

let b = Matrix.CreateRandom(100, 5)
let rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
let columnIndex = Index.Create([|"a"; "b"; "c"; "d"; "e" |])
let df8 = a.ToDataFrame(rowIndex, columnIndex)

Finally, a data frame can be created from a data source like a DataTable or a text file. This is discussed in the next section.

Importing and Exporting

Most data starts out in an external data source. This section outlines how to load data from external data sources into a data frame and how to save a data frame to an external data source.

Importing and exporting text files

Data frames can be read from a text file using the ReadCsv method. This method takes as its first argument the path to the file to be read or a stream to read from. This constructs a data frame containing the data in the file with the columns indexed by the headers from the file. Optionally, you can specify the column that should be used as the row index. In this case you must also provide the data type of the column as a generic type argument.

The WriteCsvR, C method lets you export a data frame to CSV format. It is defined as an extension method in the DataFrame class. The code below illustrates all these methods.

C#
VB
C++
F#
Copy
var df2a = df2.WithRowIndex<string,int>("state", "year");
Dim df2a = df2.WithRowIndex(Of String, Integer)("state", "year")

No code example is currently available or this language may not be supported.

let df2a = df2.WithRowIndex<string,int>("state", "year")
Importing from data tables

A data frame can be created from a DataTable.

The FromDataTable method has four overloads, which come in two pairs. The first argument in all overloads is a DataTable that specifies the source of the data. An optional second argument is a sequence of strings that contain the names of the columns to retain in the data frame. These two overloads take no generic type arguments and return a data frame with the column names as column keys and row numbers as row keys.

The second pair of overloads take one generic type argument: the element type of the row keys. The first argument is once again the data table. The second argument is the name of the column that contains the row index. The values in this column must be convertible to the type specified by the generic type argument. An optional third argument once again specifies the names of the columns that should be included in the data frame.

Importing from other file formats

Several more common file formats are supported, either directly or in separate assemblies. Supported formats (will) include: R, Stata, SAS, Excel, HDF5.

Copyright (c) 2004-2021 ExoAnalytics Inc.

Send comments on this topic to support@extremeoptimization.com

Copyright © 2004-2021, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.