Extreme Optimization™: Complexity made simple.

Math and Statistics
Libraries for .NET

  • Home
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • Resources
    • Downloads
    • QuickStart Samples
    • Sample Applications
    • Frequently Asked Questions
    • Technical Support
  • Blog
  • Order
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Partners
    • Contact us
Introduction
Deployment Guide
Nuget packages
Configuration
Using Parallelism
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Data Analysis Library User's GuideData Analysis Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand Data Access Library User's GuideData Access Library User's Guide
Expand ReferenceReference
  • Extreme Optimization
    • Features
    • Solutions
    • Documentation
    • QuickStart Samples
    • Sample Applications
    • Downloads
    • Technical Support
    • Download trial
    • How to buy
    • Blog
    • Company
    • Resources
  • Documentation
    • Introduction
    • Deployment Guide
    • Nuget packages
    • Configuration
    • Using Parallelism
    • Mathematics Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Data Access Library User's Guide
    • Reference
  • Data Analysis Library User's Guide
    • Indexes
    • Data Frames
    • Data wrangling
    • Grouping and Aggregation
    • Working with Categorical Data
    • Working with Time Series Data
  • Data wrangling
    • Joining and reshaping data frames
    • Transforming columns
    • Sorting And Filtering
  • Transforming columns

Transforming columns

Extreme Optimization Numerical Libraries for .NET Professional

Data frames are column based data structures. Manipulating columns is an essential operation when working with data frames.

While columns in a data frame are immutable, the list of columns itself is not. Adding and removing columns changes the existing data frame. Adding and removing rows always constructs a new data frame.

Adding and removing columns

The AddColumn method adds a column to the end of a data frame. The method takes two arguments. The first is the key of the new column. If a column with the same key already exists in the data frame, an exception is thrown. The second argument is the actual column, either as a list or a vector. In the example below, we construct a data frame with a couple of columns and then add two more columns:

C#
VB
C++
F#
Copy
var dates = Index.CreateDateRange(new DateTime(2015, 11, 11), 5, Recurrence.Daily);
var df1 = DataFrame.FromColumns(new Dictionary<string, object>() {
        { "values1", Vector.CreateRandom(5) },
        { "values2", Vector.CreateRandom(5) },
        }, dates);

df1.AddColumn("vzlues3", Vector.CreateRandom(5));
df1.AddColumn("values4", Vector.CreateRandom(5));
Dim dates = Index.CreateDateRange(New DateTime(2015, 11, 11), 5, Recurrence.Daily)
Dim df1 = DataFrame.FromColumns(New Dictionary(Of String, Object)() From {
    {"values1", Vector.CreateRandom(5)},
    {"values2", Vector.CreateRandom(5)}}, dates)
df1.AddColumn("vzlues3", Vector.CreateRandom(5))
df1.AddColumn("values4", Vector.CreateRandom(5))

No code example is currently available or this language may not be supported.

let dates = Index.CreateDateRange(DateTime(2015, 11, 11), 5, Recurrence.Daily)
let df1 = DataFrame.FromColumns(ofDict
            [
             "values1" => Vector.CreateRandom(5)
             "values2" => Vector.CreateRandom(5)
            ], dates)
df1.AddColumn("vzlues3", Vector.CreateRandom(5)) |> ignore
df1.AddColumn("values4", Vector.CreateRandom(5)) |> ignore

The exact behavior depends on whether the column vector has itself an index. If it does, and the element type is the same as the data frame's row index, then the values in the vector are inserted in the row with the matching index. This corresponds to a left join. Otherwise, the column is added as is.

The RenameColumn method allows you to change the key of an existing column or set of columns. This method has three overloads, which all take two arguments. In the first overload, the first argument is the original key, and the second argument is the new key. In the second overload, the first argument is a sequence of keys to replace, and the second overload is a sequence of keys to replace them with. In the third overload, the first argument is a predicate that returns whether a key should be replaced. The second argument is a function that transforms the current key into a new key.

In the example below, we first change the key of one column. We then replace all keys that start with a misspelled word, and replace it with the correct word:

C#
VB
C++
F#
Copy
df1.RenameColumn("values4", "vzlues5");
df1.RenameColumns(s => s.StartsWith("vzlues"), s => "values" + s.Substring(6));
df1.RenameColumn("values4", "vzlues5")
df1.RenameColumns(
    Function(s) s.StartsWith("vzlues"),
    Function(s) "values" + s.Substring(6))

No code example is currently available or this language may not be supported.

df1.RenameColumn("values4", "vzlues5") |> ignore
df1.RenameColumns(
    (fun s -> s.StartsWith("vzlues")), 
    (fun s -> "values" + s.Substring(6))) |> ignore

The RemoveColumn method removes a column from the data frame. The method takes as its only argument the key that should be removed. The RemoveColumnAt method removes the column at the specified position. Finally, theRemoveColumnsWithMissingValues method removes any columns that contain missing values. In the next example, we remove two of the columns, one by key and one by position:

C#
VB
C++
F#
Copy
df1.RemoveColumn("values5");
df1.RemoveColumnAt(2);
df1.RemoveColumn("values5")
df1.RemoveColumnAt(2)

No code example is currently available or this language may not be supported.

df1.RemoveColumn("values5") |> ignore
df1.RemoveColumnAt(2) |> ignore
Transforming columns

New columns are often derived from existing columns. Several methods make this process more streamlined. The methods exist in four variations: one where a new vector is returned without changing the data frame, one where the column is replaced, one where the column is inserted after the column it is derived from, and one where the new column is appended as the last column in the data frame. Each variation can be applied to individual columns or several columns at once.

Mapping single columns

The MapAndReplace method replaces a column with the result of applying a function to it. This method has three arguments. The first is the key of the column that is to be replaced. The second is a delegate that represents the mapping. The third argument is the key for the replacement column. The method replaces the column in the current data frame with the transformed version. The element type of the source column must be supplied as a generic type argument if it cannot be inferred from the transformation function.

The MapAndAppend method transforms the column, but leaves the original column and appends the result as the last column in the data frame. This method takes the same three arguments as the previous method in the same order.

The MapAndInsertAfter method again transforms the column, but instead of replacing the original, it inserts the result after the original column. This method takes the same three arguments as the two previous methods in the same order.

In the example below, we use each of the three methods on columns in the data frame defined earlier:

C#
VB
C++
F#
Copy
df1.MapAndAppend<double>("values1", x => Vector.Cos(x), "cosValues1");
df1.MapAndInsertAfter<double>("values1", x => Vector.Sin(x), "sinValues1");
df1.MapAndReplace<double>("values2", x => Vector.Exp(x), "expValues2");
df1.MapAndAppend(Of Double)("values1", Function(x) Vector.Cos(x), "cosValues1")
df1.MapAndInsertAfter(Of Double)("values1", Function(x) Vector.Sin(x), "sinValues1")
df1.MapAndReplace(Of Double)("values2", Function(x) Vector.Exp(x), "expValues2")

No code example is currently available or this language may not be supported.

df1.MapAndAppend<float>("values1", (fun x -> Vector.Cos(x) :> IVector), "cosValues1") |> ignore
df1.MapAndInsertAfter<float>("values1", (fun x -> Vector.Sin(x) :> IVector), "sinValues1") |> ignore
df1.MapAndReplace<float>("values2", (fun x -> Vector.Exp(x) :> IVector), "expValues2") |> ignore
Mapping multiple columns

The same three methods discussed in the previous section have two overloads that operate on several columns at once. Both overloads take three arguments. The first is a list of the keys of the columns that should be transformed. The second argument is a delegate that represents the transformation. The third argument can be either a list of new column keys, or a delegate that maps old column keys to new column keys. In the example below, we illustrate both methods of specifying the keys of the new columns:

C#
VB
C++
F#
Copy
var columns = new[] { "values1", "values2" };
var negColumns = new[] { "-values1", "-values2" };
df1.MapAndAppend<double>(columns, x => -x, negColumns);
df1.MapAndInsertAfter<double>(columns, x => 2.0 * x, s => "2*" + s);
Dim columns = {"values1", "values2"}
Dim negColumns = {"-values1", "-values2"}
df1.MapAndAppend(Of Double)(columns, Function(x) -x, negColumns)
df1.MapAndInsertAfter(Of Double)(columns, Function(x) 2.0 * x, Function(s) "2*" + s)

No code example is currently available or this language may not be supported.

let columns = [| "values1"; "values2" |]
let negColumns = [| "-values1"; "-values2" |]
df1.MapAndAppend<float>(columns, (fun x -> (-x) :> IVector), negColumns) |> ignore
df1.MapAndInsertAfter<float>(columns, 
        (fun (x : Vector<float>) -> (2.0 * x) :> IVector), 
        (fun s -> "2*" + s)) |> ignore

In addition, the Map method applies a function to multiple columns and returns the result as a new data frame. It takes one or three arguments. In the three argument overload, the arguments have the same meaning as before. If only one argument is supplied, it is the transformation function. The element type of the source columns must be provided as a generic type argument. It applies the function to every column of the specified type and returns the result in a new data frame.

A note on performance

Since data frames are column oriented data structures, operations that operate on columns directly are much more efficient than operations that operate on rows. The Rows property is made available for the sake of convenience, but should be avoided in most instances. In particular, it is very inefficient to iterate through the rows to perform a calculation on some variables in each row. Nearly all such operations can be performed much more efficiently by operating directly on the columns.

For example, let's say we have a DataFrameR, C with three numerical columns: value1, value2 and value3. We want to calculate a fourth column that contains the average of each of the three variables for each observation. A naive implementation would be as follows:

C#
VB
C++
F#
Copy
var avg = Vector.Create<double>(df1.RowCount);
int i = 0;
foreach (var row in df1.Rows)
{
    avg[i] = (row.Get<double>("value1")
            + row.Get<double>("value2")
            + row.Get<double>("value3")) / 3;
    i++;
}
    Dim avg = Vector.Create(Of Double)(df1.RowCount)
    Dim i = 0
    For Each row In df1.Rows
        avg(i) = (row.Get(Of Double)("value1") _
                + row.Get(Of Double)("value2") _
                + row.Get(Of Double)("value3")) / 3
        i = i + 1
    Next
End Sub

No code example is currently available or this language may not be supported.

let avg = Vector.Create<float>(df1.RowCount)
let mutable i = 0
for row in df1.Rows do
    avg.[i] <- (row?value1 + row?value2 + row?value3) / 3.0
    i <- i + 1

However, the most compact and efficient expression is simply:

C#
VB
C++
F#
Copy
var avg = (df1.GetColumn("value1")
         + df1.GetColumn("value1")
         + df1.GetColumn("value3")) / 3.0;
df1.AddColumn("Average", avg);
Dim avg = (df1.GetColumn("value1") _
         + df1.GetColumn("value1") _
         + df1.GetColumn("value3")) / 3.0
df1.AddColumn("Average", avg)

No code example is currently available or this language may not be supported.

let avg = (df1?value1 + df1?value1 + df1?value3) / 3.0
df1.AddColumn("Average", avg) |> ignore

Copyright (c) 2004-2021 ExoAnalytics Inc.

Send comments on this topic to support@extremeoptimization.com

Copyright © 2004-2021, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.