Text Files

Plain text remains one of the most common formats for data exchange. This section covers delimited text files (including CSV and TSV files), fixed-width text files, and files in Matrix Market format. The classes that implement accessing text files live in the Extreme.Data.Text namespace.

JSON is also a text-based format. It is much more flexible than the other text formats and is covered in its own section.

Options when reading and writing text files

Whereas in binary files, the format of values is specified exactly, this is far from true for text files. Metadata, such as the data type of columns, is usually not stored. Dates and numbers are written differently depending on the culture.

The TextOptions class defines a number of properties that help ensure that data is read or written correctly. Specialized classes that inherit from TextOptions define additional options for delimited and fixed-width files, respectively.

The FormatProvider property specifies the IFormatProvider that is used in the conversion between text and other data types. The default value is InvariantCulture. The DecimalType property specifies which type should be used to store numbers that contain decimals. The default value is Double, but Decimal and Single are allowed as well.

The RowHeaders and ColumnHeaders properties are boolean values that specify whether the file includes row headers or column headers. If ColumnHeaders is true (the default), then the first row in the file is assumed to contain the names of the columns. These names are then used as the column index for the data frame or matrix that is being read. If RowHeaders is true, then the first field in each row is used as a row key. The row keys are then used as the row index of the data frame of matrix.

The StartRow property specifies the zero-based index of the first row that contains data (including the header row, if present). It can be though of as the number of rows to skip at the beginning of the file. The default is zero.

Finally, the InferenceRows property specifies the number of rows that should be used to attempt to infer the data type of each row. The default is 100. The FormatProvider is used to attempt to parse up to the specified number of values of each field as an integer, a decimal number or a date. The type that is most common is selected.

Options objects are read-only. To change a property, a new object needs to be created. To make this easier, a number of methods have been defined, one for each property, that allow you to specify only the property that is being changed. For example, the WithFormatProvider<T>(T, IFormatProvider) method returns a new options object with the same property values as the current one, except for the FormatProvider which is set to the supplied value.

Complex vectors and matrices

There is no standard format for complex numbers. In text-based formats, complex numbers are usually represented by two columns, one column for the real part, and one column for the imaginary part.

Data streams for text-based formats have some additional methods for reading complex vectors and matrices, and some extra overloads to write complex vectors and matrices.