A data frame is an collection of columns that may have different element types
and that has indexed access to rows and columns.
The minimal functionality of a data frame is captured by the
IDataFrame
interface. The main implementation of a data frame is the
DataFrameR, C
class. Vectors and matrices also implement
IDataFrame.
The DataFrameR, C class
itself has no constructors. Instead, data frames are created
by performing operations on existing data frames,
or by calling one of the factory methods of the static
DataFrame
class.
All these methods take the type of the row and column keys as
generic type arguments. However, in most cases they can be
inferred from the arguments so they can be omitted.
The simplest method,
CreateEmptyR, C,
takes no arguments. It creates an empty data frame.
The type of the row keys and the column keys must be specified
as generic type arguments.
You can add and remove columns using the methods in the next section.
The first column that is added determines the row index.
There are several ways to create a data frame from a set of vectors.
The method is called
FromColumns
and has multiple overloads.
There are two basic mechanisms:
you can specify a dictionary that maps column keys to the column values,
or you can use separate collections of column keys and columns.
When using a dictionary, it is the first argument:
var data = new Dictionary<string, object>() {
{ "state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" } },
{ "year", new int[] { 2000, 2001, 2002, 2001, 2002 } },
{ "pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 } }
};
var df1 = DataFrame.FromColumns(data);
Dim data = New Dictionary(Of String, Object)() From {
{"state", {"Ohio", "Ohio", "Ohio", "Nevada", "Nevada"}},
{"year", {2000, 2001, 2002, 2001, 2002}},
{"pop", {1.5, 1.7, 3.6, 2.4, 2.9}}}
Dim df1 = DataFrame.FromColumns(data)
No code example is currently available or this language may not be supported.
let (=>) a b = (a, box b)
let ofDict x =
let d = Dictionary<_, obj>()
Seq.iter (fun kvp -> d.Add(fst kvp, snd kvp)) x
d
let data = ofDict [
"state" => [| "Ohio"; "Ohio"; "Ohio"; "Nevada"; "Nevada" |]
"year" => [| 2000; 2001; 2002; 2001; 2002 |]
"pop" => [| 1.5; 1.7; 3.6; 2.4; 2.9 |] ]
let df1 = DataFrame.FromColumns<string>(data)
The second argument is optional and specifies the row index.
If no row index is provided, the first index of the correct type
found in one of the columns is used. If no row index key type is specified,
row numbers are used as keys.
var df2 = DataFrame.FromColumns(new Dictionary<string, object>() {
{ "first", new double[] { 11, 14, 17, 93, 55 } },
{ "second", new double[] { 22, 33, 43, 51, 69 } } },
Index.CreateDateRange(new DateTime(2015, 4, 1), 5));
Dim df2 = DataFrame.FromColumns(New Dictionary(Of String, Object)() From {
{"first", {11, 14, 17, 93, 55}},
{"second", {22, 33, 43, 51, 69}}},
Index.CreateDateRange(New DateTime(2015, 4, 1), 5))
No code example is currently available or this language may not be supported.
let df2 = DataFrame.FromColumns(ofDict
[
"first" => [| 11.0; 14.0; 17.0; 93.0; 55.0 |],
"second" => [| 22.0; 33.0; 43.0; 51.0; 69.0 |]
],
Index.CreateDateRange(new DateTime(2015, 4, 1), 5))
It is also possible to supply a column index for the new data frame.
In this case, only columns that are present in the column index
are included in the new data frame.
If a key in the column index cannot be found in the dictionary,
the corresponding column is still included,
but it will consist entirely of missing values,
as in the following example where the key 'debt' is not
in the dictionary:
var df2a = DataFrame.FromColumns(data,
Index.Create(new[] { "one", "two", "three", "four", "five" }),
Index.Create(new[] { "year", "state", "pop", "debt" }));
Dim df2a = DataFrame.FromColumns(data,
Index.Create({"one", "two", "three", "four", "five"}),
Index.Create({"year", "state", "pop", "debt"}))
No code example is currently available or this language may not be supported.
let df2a = DataFrame.FromColumns(data,
Index.Create([| "one"; "two"; "three"; "four"; "five" |]),
Index.Create([| "year"; "state"; "pop"; "debt" |]))
A data frame may be created from a sequence of columns and a sequence of keys
separately. The relevant overloads take two arguments. The first is a
sequence of vectors. This may be a strongly typed vector or, if the columns
have different element types, a sequence of IVector
objects. The second argument is a sequence of column keys:
var df3 = DataFrame.FromColumns(new Vector<double>[] {
Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0),
Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0) },
Index.Create(new[] { "First", "Second" }));
Dim df3 = DataFrame.FromColumns(New Vector(Of Double)() {
Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0),
Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0)},
Index.Create({"First", "Second"}))
No code example is currently available or this language may not be supported.
let columns1 : IVector[] =
[|
Vector.Create(11.0, 14.0, 17.0, 93.0, 55.0) ;
Vector.Create(22.0, 33.0, 43.0, 51.0, 69.0)
|]
let df3 = DataFrame.FromColumns(columns1,
Index.Create([| "First"; "Second" |]))
The next option is to supply a set of tuples where the first item
is the column key and the second item is a list of values.
This overload simply takes a (parameter) array of tuples as its only argument:
var df5 = DataFrame.FromColumns(
("state", new string[] { "Ohio", "Ohio", "Ohio", "Nevada", "Nevada" }),
("year", new int[] { 2000, 2001, 2002, 2001, 2002 }),
("pop", new double[] { 1.5, 1.7, 3.6, 2.4, 2.9 }));
Dim CreateTuple As Func(Of String, Object, Tuple(Of String, Object)) =
Function(x, y) Tuple.Create(x, y)
Dim df5 = DataFrame.FromColumns(
("state", {"Ohio", "Ohio", "Ohio", "Nevada", "Nevada"}),
("year", {2000, 2001, 2002, 2001, 2002}),
("pop", {1.5, 1.7, 3.6, 2.4, 2.9}))
No code example is currently available or this language may not be supported.
let df5 = DataFrame.FromColumns(
struct("state", box [| "Ohio"; "Ohio"; "Ohio"; "Nevada"; "Nevada" |]),
struct("year", box [| 2000; 2001; 2002; 2001; 2002 |]),
struct("pop", box [| 1.5; 1.7; 3.6; 2.4; 2.9 |]))
Another way to construct a data frame is from a matrix.
This can be done in two ways. The simplest is to call the
ToDataFrame
method on the matrix.
The type of the row keys and the column keys must be specified
as generic type arguments and must match the type
of the existing indexes of the matrix.
var a = Matrix.CreateRandom(100, 5);
a.RowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
a.ColumnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df7 = a.ToDataFrame<DateTime, string>();
Dim a = Matrix.CreateRandom(100, 5)
a.RowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
a.ColumnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df7 = a.ToDataFrame(Of DateTime, String)()
No code example is currently available or this language may not be supported.
let a = Matrix.CreateRandom(100, 5)
a.RowIndex <- Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
a.ColumnIndex <- Index.Create([| "a"; "b"; "c"; "d"; "e" |])
let df7 = a.ToDataFrame<DateTime, string>()
You can also supply the row and column indexes as arguments
to an overload of this method. In this case,
the generic type arguments can be inferred:
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);
Dim b = Matrix.CreateRandom(100, 5)
Dim rowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
Dim columnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df8 = a.ToDataFrame(rowIndex, columnIndex)
No code example is currently available or this language may not be supported.
let b = Matrix.CreateRandom(100, 5)
let rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
let columnIndex = Index.Create([|"a"; "b"; "c"; "d"; "e" |])
let df8 = a.ToDataFrame(rowIndex, columnIndex)
Alternatively, you can call the
FromMatrix method.
This method takes three arguments: the matrix, the row index
and the column index. If the row or column index are
null, the corresponding index from the matrix is used.
If it does not have an index of the right type,
an InvalidOperationException is thrown.
A data frame can be created from a sequence or list of .NET objects.
The FromObjects
method takes one generic type argument: the type of the objects,
and can usually be inferred. There are two overloads.
The first overload takes one argument: the sequence of objects
of the specified type. This method returns a data frame
with one row for each object in the sequence and one column
for each public property. The column keys correspond to the names
of the properties. The second overload takes as an optional second argument
a list of the properties that should be included in the data frame.
The order in which the properties are listed is preserved in the data frame.
The following example illustrates both overloads:
var b = Matrix.CreateRandom(100, 5);
var rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100);
var columnIndex = Index.Create(new[] { "a", "b", "c", "d", "e" });
var df8 = a.ToDataFrame(rowIndex, columnIndex);
Dim b = Matrix.CreateRandom(100, 5)
Dim rowIndex = Index.CreateDateRange(New DateTime(2016, 1, 1), 100)
Dim columnIndex = Index.Create({"a", "b", "c", "d", "e"})
Dim df8 = a.ToDataFrame(rowIndex, columnIndex)
No code example is currently available or this language may not be supported.
let b = Matrix.CreateRandom(100, 5)
let rowIndex = Index.CreateDateRange(new DateTime(2016, 1, 1), 100)
let columnIndex = Index.Create([|"a"; "b"; "c"; "d"; "e" |])
let df8 = a.ToDataFrame(rowIndex, columnIndex)
Finally, a data frame can be created from a data source like
a DataTable
or a text file. This is discussed in the next section.
Most data starts out in an external data source.
This section outlines how to load data from external data sources
into a data frame and how to save a data frame to an external data source.
Importing and exporting text files
Data frames can be read from a text file using the
ReadCsv method.
This method takes as its first argument the path to the file to be read or a stream
to read from. This constructs a data frame containing the data in the file
with the columns indexed by the headers from the file.
Optionally, you can specify the column that should be used as the row index. In this case
you must also provide the data type of the column as a generic type argument.
The WriteCsvR, C
method lets you export a data frame to CSV format. It is defined as an extension method
in the DataFrame class.
The code below illustrates all these methods.
var df2a = df2.WithRowIndex<string,int>("state", "year");
Dim df2a = df2.WithRowIndex(Of String, Integer)("state", "year")
No code example is currently available or this language may not be supported.
let df2a = df2.WithRowIndex<string,int>("state", "year")
Importing from data tables
A data frame can be created from a
DataTable.
The FromDataTable
method has four overloads, which come in two pairs.
The first argument in all overloads is a
DataTable
that specifies the source of the data.
An optional second argument is a sequence of strings
that contain the names of the columns to retain in the data frame.
These two overloads take no generic type arguments and return a data frame
with the column names as column keys and row numbers as row keys.
The second pair of overloads take one generic type argument:
the element type of the row keys. The first argument is once again the data table.
The second argument is the name of the column that contains the row index.
The values in this column must be convertible to the type specified
by the generic type argument.
An optional third argument once again specifies the names of the
columns that should be included in the data frame.
Importing from other file formats
Several more common file formats are supported, either directly
or in separate assemblies. Supported formats (will) include:
R, Stata, SAS, Excel, HDF5.