Data frames are column based data structures.
Manipulating columns is an essential operation
when working with data frames.
While columns in a data frame are immutable,
the list of columns itself is not.
Adding and removing columns changes the
existing data frame. Adding and removing rows
always constructs a new data frame.
Adding and removing columns
The AddColumn
method adds a column to the end of a data frame. The method takes two arguments.
The first is the key of the new column. If a column with the same key
already exists in the data frame, an exception is thrown.
The second argument is the actual column, either as a list or a vector.
In the example below, we construct a data frame with a couple of columns
and then add two more columns:
var dates = Index.CreateDateRange(new DateTime(2015, 11, 11), 5, Recurrence.Daily);
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
{ "values1", Vector.CreateRandom(5) },
{ "values2", Vector.CreateRandom(5) },
}, dates);
df1.AddColumn("vzlues3", Vector.CreateRandom(5));
df1.AddColumn("values4", Vector.CreateRandom(5));
Dim dates = Index.CreateDateRange(New DateTime(2015, 11, 11), 5, Recurrence.Daily)
Dim df1 = DataFrame.FromColumns(New Dictionary(Of String, Object)() From {
{"values1", Vector.CreateRandom(5)},
{"values2", Vector.CreateRandom(5)}}, dates)
df1.AddColumn("vzlues3", Vector.CreateRandom(5))
df1.AddColumn("values4", Vector.CreateRandom(5))
No code example is currently available or this language may not be supported.
let dates = Index.CreateDateRange(DateTime(2015, 11, 11), 5, Recurrence.Daily)
let df1 = DataFrame.FromColumns(ofDict
[
"values1" => Vector.CreateRandom(5)
"values2" => Vector.CreateRandom(5)
], dates)
df1.AddColumn("vzlues3", Vector.CreateRandom(5)) |> ignore
df1.AddColumn("values4", Vector.CreateRandom(5)) |> ignore
The exact behavior depends on whether the column vector has itself an index.
If it does, and the element type is the same as the data frame's row index,
then the values in the vector are inserted in the row with the matching index.
This corresponds to a left join. Otherwise, the column is added
as is.
The RenameColumn
method allows you to change the key of an existing column or set of columns.
This method has three overloads, which all take two arguments.
In the first overload, the first argument is the original key,
and the second argument is the new key. In the second overload,
the first argument is a sequence of keys to replace, and the second overload
is a sequence of keys to replace them with. In the third overload,
the first argument is a predicate that returns whether a key should be
replaced. The second argument is a function that transforms the current key
into a new key.
In the example below, we first change the key of one column. We then
replace all keys that start with a misspelled word, and replace it with
the correct word:
df1.RenameColumn("values4", "vzlues5");
df1.RenameColumns(s => s.StartsWith("vzlues"), s => "values" + s.Substring(6));
df1.RenameColumn("values4", "vzlues5")
df1.RenameColumns(
Function(s) s.StartsWith("vzlues"),
Function(s) "values" + s.Substring(6))
No code example is currently available or this language may not be supported.
df1.RenameColumn("values4", "vzlues5") |> ignore
df1.RenameColumns(
(fun s -> s.StartsWith("vzlues")),
(fun s -> "values" + s.Substring(6))) |> ignore
The RemoveColumn
method removes a column from the data frame. The method takes as its only
argument the key that should be removed.
The RemoveColumnAt
method removes the column at the specified position.
Finally, theRemoveColumnsWithMissingValues
method removes any columns that contain missing values.
In the next example, we remove two of the columns, one by key and one
by position:
df1.RemoveColumn("values5");
df1.RemoveColumnAt(2);
df1.RemoveColumn("values5")
df1.RemoveColumnAt(2)
No code example is currently available or this language may not be supported.
df1.RemoveColumn("values5") |> ignore
df1.RemoveColumnAt(2) |> ignore
New columns are often derived from existing columns. Several methods
make this process more streamlined. The methods exist in four variations:
one where a new vector is returned without changing the data frame,
one where the column is replaced, one where the column is inserted after
the column it is derived from, and one where the new column is appended
as the last column in the data frame. Each variation can be applied to
individual columns or several columns at once.
The MapAndReplace
method replaces a column with the result of applying a function to it.
This method has three arguments. The first is the key of the column
that is to be replaced. The second is a delegate that represents the mapping.
The third argument is the key for the replacement column.
The method replaces the column in the current data frame
with the transformed version. The element type of the source column
must be supplied as a generic type argument if it cannot be inferred
from the transformation function.
The MapAndAppend
method transforms the column, but leaves the original column and appends the result
as the last column in the data frame. This method
takes the same three arguments as the previous method in the same order.
The MapAndInsertAfter
method again transforms the column, but instead of replacing the original,
it inserts the result after the original column. This method
takes the same three arguments as the two previous methods in the same order.
In the example below, we use each of the three methods on columns
in the data frame defined earlier:
df1.MapAndAppend<double>("values1", x => Vector.Cos(x), "cosValues1");
df1.MapAndInsertAfter<double>("values1", x => Vector.Sin(x), "sinValues1");
df1.MapAndReplace<double>("values2", x => Vector.Exp(x), "expValues2");
df1.MapAndAppend(Of Double)("values1", Function(x) Vector.Cos(x), "cosValues1")
df1.MapAndInsertAfter(Of Double)("values1", Function(x) Vector.Sin(x), "sinValues1")
df1.MapAndReplace(Of Double)("values2", Function(x) Vector.Exp(x), "expValues2")
No code example is currently available or this language may not be supported.
df1.MapAndAppend<float>("values1", (fun x -> Vector.Cos(x) :> IVector), "cosValues1") |> ignore
df1.MapAndInsertAfter<float>("values1", (fun x -> Vector.Sin(x) :> IVector), "sinValues1") |> ignore
df1.MapAndReplace<float>("values2", (fun x -> Vector.Exp(x) :> IVector), "expValues2") |> ignore
The same three methods discussed in the previous section have
two overloads that operate on several columns at once.
Both overloads take three arguments. The first is a list of the
keys of the columns that should be transformed. The second argument
is a delegate that represents the transformation.
The third argument can be either a list of new column keys,
or a delegate that maps old column keys to new column keys.
In the example below, we illustrate both methods of specifying
the keys of the new columns:
var columns = new[] { "values1", "values2" };
var negColumns = new[] { "-values1", "-values2" };
df1.MapAndAppend<double>(columns, x => -x, negColumns);
df1.MapAndInsertAfter<double>(columns, x => 2.0 * x, s => "2*" + s);
Dim columns = {"values1", "values2"}
Dim negColumns = {"-values1", "-values2"}
df1.MapAndAppend(Of Double)(columns, Function(x) -x, negColumns)
df1.MapAndInsertAfter(Of Double)(columns, Function(x) 2.0 * x, Function(s) "2*" + s)
No code example is currently available or this language may not be supported.
let columns = [| "values1"; "values2" |]
let negColumns = [| "-values1"; "-values2" |]
df1.MapAndAppend<float>(columns, (fun x -> (-x) :> IVector), negColumns) |> ignore
df1.MapAndInsertAfter<float>(columns,
(fun (x : Vector<float>) -> (2.0 * x) :> IVector),
(fun s -> "2*" + s)) |> ignore
In addition, the
Map
method applies a function to multiple columns and returns the result
as a new data frame. It takes one or three arguments.
In the three argument overload, the arguments have the same meaning
as before. If only one argument is supplied, it is the transformation
function. The element type of the source columns must be provided
as a generic type argument. It applies the function to every column
of the specified type and returns the result in a new data frame.
Since data frames are column oriented data structures, operations that operate
on columns directly are much more efficient than operations that operate on
rows. The Rows
property is made available for the sake of convenience,
but should be avoided in most instances. In particular,
it is very inefficient to iterate through the rows to perform a calculation
on some variables in each row. Nearly all such
operations can be performed much more efficiently by operating directly on the columns.
For example, let's say we have a DataFrameR, C
with three numerical columns: value1,
value2 and value3.
We want to calculate a fourth column that contains the average
of each of the three variables for each observation. A naive implementation would
be as follows:
var avg = Vector.Create<double>(df1.RowCount);
int i = 0;
foreach (var row in df1.Rows)
{
avg[i] = (row.Get<double>("value1")
+ row.Get<double>("value2")
+ row.Get<double>("value3")) / 3;
i++;
}
Dim avg = Vector.Create(Of Double)(df1.RowCount)
Dim i = 0
For Each row In df1.Rows
avg(i) = (row.Get(Of Double)("value1") _
+ row.Get(Of Double)("value2") _
+ row.Get(Of Double)("value3")) / 3
i = i + 1
Next
End Sub
No code example is currently available or this language may not be supported.
let avg = Vector.Create<float>(df1.RowCount)
let mutable i = 0
for row in df1.Rows do
avg.[i] <- (row?value1 + row?value2 + row?value3) / 3.0
i <- i + 1
However, the most compact and efficient expression is simply:
var avg = (df1.GetColumn("value1")
+ df1.GetColumn("value1")
+ df1.GetColumn("value3")) / 3.0;
df1.AddColumn("Average", avg);
Dim avg = (df1.GetColumn("value1") _
+ df1.GetColumn("value1") _
+ df1.GetColumn("value3")) / 3.0
df1.AddColumn("Average", avg)
No code example is currently available or this language may not be supported.
let avg = (df1?value1 + df1?value1 + df1?value3) / 3.0
df1.AddColumn("Average", avg) |> ignore