Plain text remains one of the most common formats for data exchange.
This section covers delimited text files (including CSV and TSV files),
fixed-width text files, and files in Matrix Market format.
The classes that implement accessing text files live in the
Extreme.Data.Text
namespace.
JSON is also a text-based format. It is much more flexible than
the other text formats and is covered in its own section.
Options when reading and writing text files
Whereas in binary files, the format of values is specified exactly,
this is far from true for text files. Metadata, such as the data type
of columns, is usually not stored. Dates and numbers are written differently
depending on the culture.
The TextOptions
class defines a number of properties that help ensure that data is read
or written correctly. Specialized classes that inherit from
TextOptions
define additional options for delimited and fixed-width files, respectively.
The FormatProvider
property specifies the IFormatProvider
that is used in the conversion between text and other data types.
The default value is InvariantCulture.
The DecimalType
property specifies which type should be used to store numbers that contain decimals.
The default value is Double,
but Decimal and
Single are allowed as well.
The RowHeaders
and ColumnHeaders
properties are boolean values that specify whether the file includes row headers
or column headers. If ColumnHeaders
is (the default), then the first row in the file is assumed to contain
the names of the columns. These names are then used as the column index
for the data frame or matrix that is being read.
If RowHeaders
is , then the first field in each row is used
as a row key. The row keys are then used as the row index of the data frame of matrix.
The StartRow
property specifies the zero-based index of the first row that contains data
(including the header row, if present). It can be though of as the number of rows
to skip at the beginning of the file. The default is zero.
Finally, the InferenceRows
property specifies the number of rows that should be used to attempt to infer the data type
of each row. The default is 100. The FormatProvider
is used to attempt to parse up to the specified number of values of each field as an integer,
a decimal number or a date. The type that is most common is selected.
Options objects are read-only. To change a property, a new object needs to be created.
To make this easier, a number of methods have been defined, one for each property,
that allow you to specify only the property that is being changed. For example,
the WithFormatProviderT(T, IFormatProvider)
method returns a new options object with the same property values as the current one,
except for the FormatProvider
which is set to the supplied value.
Complex vectors and matrices
There is no standard format for complex numbers. In text-based formats, complex
numbers are usually represented by two columns, one column for the real part,
and one column for the imaginary part.
Data streams for text-based formats have some additional methods for reading
complex vectors and matrices, and some extra overloads to write complex vectors
and matrices.