CSV.jl Documentation

CSV.jl Documentation
- High-level interface
- Lower-level utilities

High-level interface

# CSV.read — Function.

CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...) => typeof(sink) CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...) => Data.Sink

parses a delimited file into a Julia structure (a DataFrame by default, but any Data.Sink may be given).

Positional arguments:

fullpath; can be a file name (string) or other IO instance
sink::Type{T}; DataFrame by default, but may also be other Data.Sink types that support streaming via Data.Field interface; note that the method argument can be the type of Data.Sink, plus any required arguments the sink may need (args...). or an already constructed sink may be passed (2nd method above)

Keyword Arguments:

delim::Union{Char,UInt8}; a single character or ascii-compatible byte that indicates how fields in the file are delimited; default is UInt8(',')
quotechar::Union{Char,UInt8}; the character that indicates a quoted field that may contain the delim or newlines; default is UInt8('"')
escapechar::Union{Char,UInt8}; the character that escapes a quotechar in a quoted field; default is UInt8('\')
null::String; an ascii string that indicates how NULL values are represented in the dataset; default is the empty string, ""
header; column names can be provided manually as a complete Vector{String}, or as an Int/Range which indicates the row/rows that contain the column names
datarow::Int; specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specify datarow=1
types; column types can be provided manually as a complete Vector{DataType}, or in a Dict to reference individual columns by name or number
nullable::Bool; indicates whether values can be nullable or not; true by default. If set to false and missing values are encountered, a NullException will be thrown
dateformat::Union{AbstractString,Dates.DateFormat}; how all dates/datetimes in the dataset are formatted
footerskip::Int; indicates the number of rows to skip at the end of the file
rows_for_type_detect::Int=100; indicates how many rows should be read to infer the types of columns
rows::Int; indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows; -1 can be passed to skip a full-file scan, but the Data.Sink must be setup account for a potentially unknown # of rows
use_mmap::Bool=true; whether the underlying file will be mmapped or not while parsing

Note by default, "string" or text columns will be parsed as the WeakRefString type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes. To convert a String to a standard Julia string type, just call string(::WeakRefString), this also works on an entire column. Oftentimes, however, it can be convenient to work with WeakRefStrings depending on the ultimate use, such as transfering the data directly to another system and avoiding all the intermediate copying.

Example usage:

julia> dt = CSV.read("bids.csv")
7656334×9 DataFrames.DataFrame
│ Row     │ bid_id  │ bidder_id                               │ auction │ merchandise      │ device      │
├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤
│ 1       │ 0       │ "8dac2b259fd1c6d1120e519fb1ac14fbqvax8" │ "ewmzr" │ "jewelry"        │ "phone0"    │
│ 2       │ 1       │ "668d393e858e8126275433046bbd35c6tywop" │ "aeqok" │ "furniture"      │ "phone1"    │
│ 3       │ 2       │ "aa5f360084278b35d746fa6af3a7a1a5ra3xe" │ "wa00e" │ "home goods"     │ "phone2"    │
...

source

# CSV.write — Function.

CSV.write(fullpath::Union{AbstractString,IO}, source::Type{T}, args...; kwargs...) => CSV.Sink CSV.write(fullpath::Union{AbstractString,IO}, source::Data.Source; kwargs...) => CSV.Sink

write a Data.Source out to a CSV.Sink.

Positional Arguments:

fullpath; can be a file name (string) or other IO instance
source can be the type of Data.Source, plus any required args..., or an already constructed Data.Source can be passsed in directly (2nd method)

Keyword Arguments:

delim::Union{Char,UInt8}; how fields in the file will be delimited; default is UInt8(',')
quotechar::Union{Char,UInt8}; the character that indicates a quoted field that may contain the delim or newlines; default is UInt8('"')
escapechar::Union{Char,UInt8}; the character that escapes a quotechar in a quoted field; default is UInt8('\')
null::String; the ascii string that indicates how NULL values will be represented in the dataset; default is the emtpy string ""
dateformat; how dates/datetimes will be represented in the dataset; default is ISO-8601 yyyy-mm-ddTHH:MM:SS.s
header::Bool; whether to write out the column names from source
append::Bool; start writing data at the end of io; by default, io will be reset to its beginning before writing

source

Lower-level utilities

# CSV.Source — Type.

constructs a CSV.Source file ready to start parsing data from

implements the Data.Source interface for providing convenient Data.stream! methods for various Data.Sink types

source

# CSV.Sink — Type.

constructs a CSV.Sink file ready to start writing data to

implements the Data.Sink interface for providing convenient Data.stream! methods for various Data.Source types

source

# CSV.Options — Type.

Represents the various configuration settings for csv file parsing.

Keyword Arguments:

delim::Union{Char,UInt8} = how fields in the file are delimited
quotechar::Union{Char,UInt8} = the character that indicates a quoted field that may contain the delim or newlines
escapechar::Union{Char,UInt8} = the character that escapes a quotechar in a quoted field
null::String = indicates how NULL values are represented in the dataset
dateformat::Union{AbstractString,Dates.DateFormat} = how dates/datetimes are represented in the dataset

source

# CSV.parsefield — Function.

CSV.parsefield{T}(io::IO, ::Type{T}, opt::CSV.Options=CSV.Options(), row=0, col=0) => Nullable{T}

io is an IO type that is positioned at the first byte/character of an delimited-file field (i.e. a single cell) leading whitespace is ignored for Integer and Float types. returns a Nullable{T} saying whether the field contains a null value or not (empty field, missing value) field is null if the next delimiter or newline is encountered before any other characters. Specialized methods exist for Integer, Float, String, Date, and DateTime. For other types T, a generic fallback requires parse(T, str::String) to be defined. the field value may also be wrapped in opt.quotechar; two consecutive opt.quotechar results in a null field opt.null is also checked if there is a custom value provided (i.e. "NA", "\N", etc.) For numeric fields, if field is non-null and non-digit characters are encountered at any point before a delimiter or newline, an error is thrown

source

# CSV.readline — Function.

CSV.readline(io::IO, q='"', e='\', buf::IOBuffer=IOBuffer()) => String CSV.readline(source::CSV.Source) => String

read a single line from io (any IO type) or a CSV.Source as a string, accounting for potentially embedded newlines in quoted fields (e.g. value1, value2, "value3 with embedded newlines"). Can optionally provide a buf::IOBuffer type for buffer reuse

source

# CSV.readsplitline — Function.

CSV.readsplitline(io, d=',', q='"', e='\', buf::IOBuffer=IOBuffer()) => Vector{String} CSV.readsplitline(source::CSV.Source) => Vector{String}

read a single line from io (any IO type) as a Vector{String} with elements being delimited fields (separated by a delimiter d). Can optionally provide a buf::IOBuffer type for buffer reuse

source

# CSV.countlines — Function.

CSV.countlines(io::IO, quotechar, escapechar) => Int CSV.countlines(source::CSV.Source) => Int

count the number of lines in a file, accounting for potentially embedded newlines in quoted fields

source