CSV.jl Documentation
CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.
High-level interface
CSV.read — Function.CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...) => typeof(sink)
CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...) => Data.Sink
parses a delimited file into a Julia structure (a DataFrame by default, but any valid Data.Sink may be requested).
Minimal error-reporting happens w/ CSV.read for performance reasons; for problematic csv files, try CSV.validate which takes exact same arguments as CSV.read and provides much more information for why reading the file failed.
Positional arguments:
fullpath; can be a file name (string) or otherIOinstancesink::Type{T};DataFrameby default, but may also be otherData.Sinktypes that support streaming viaData.Fieldinterface; note that the method argument can be the type ofData.Sink, plus any required arguments the sink may need (args...). or an already constructedsinkmay be passed (2nd method above)
Keyword Arguments:
delim::Union{Char,UInt8}: how fields in the file are delimited; default','quotechar::Union{Char,UInt8}: the character that indicates a quoted field that may contain thedelimor newlines; default'"'escapechar::Union{Char,UInt8}: the character that escapes aquotecharin a quoted field; default'\'missingstring::String: indicates how missing values are represented in the dataset; default""dateformat::Union{AbstractString,Dates.DateFormat}: how dates/datetimes are represented in the dataset; defaultBase.Dates.ISODateTimeFormatdecimal::Union{Char,UInt8}: character to recognize as the decimal point in a float number, e.g.3.14or3,14; default'.'truestring: string to representtrue::Boolvalues in a csv file; default"true". Note thattruestringandfalsestringcannot start with the same character.falsestring: string to representfalse::Boolvalues in a csv file; default"false"header: column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column namesdatarow::Int: specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specifydatarow=1types: column types can be provided manually as a complete Vector{Type}, or in a Dict to reference individual columns by name or numberallowmissing::Symbol=:all: indicates whether columns should allow for missing values or not, that is whether their element type should beUnion{T,Missing}; by default, all columns are allowed to contain missing values. If set to:none, no column can contain missing values, and if set to:auto, only colums which contain missing values in the firstrows_for_type_detectrows are allowed to contain missing values. Column types specified viatypesare not affected by this argument.footerskip::Int: indicates the number of rows to skip at the end of the filerows_for_type_detect::Int=100: indicates how many rows should be read to infer the types of columnsrows::Int: indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows;-1can be passed to skip a full-file scan, but theData.Sinkmust be set up to account for a potentially unknown # of rowsuse_mmap::Bool=true: whether the underlying file will be mmapped or not while parsing; note that on Windows machines, the underlying file will not be "deletable" until Julia GC has run (can be run manually viagc()) due to the use of a finalizer when reading the file.append::Bool=false: if thesinkargument provided is an existing table,append=truewill append the source's data to the existing data instead of doing a full replacetransforms::Dict{Union{String,Int},Function}: a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.transpose::Bool=false: when reading the underlying csv data, rows should be treated as columns and columns as rows, thus the resulting dataset will be the "transpose" of the actual csv data.categorical::Bool=true: read string column as aCategoricalArray(ref), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.weakrefstrings::Bool=true: whether to useWeakRefStringspackage to speed up file parsing; can only be=truefor theSinkobjects that supportWeakRefStringArraycolumns. Note thatWeakRefStringArraystill returns regularStringelements.
Example usage:
julia> dt = CSV.read("bids.csv")
7656334×9 DataFrames.DataFrame
│ Row │ bid_id │ bidder_id │ auction │ merchandise │ device │
├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤
│ 1 │ 0 │ "8dac2b259fd1c6d1120e519fb1ac14fbqvax8" │ "ewmzr" │ "jewelry" │ "phone0" │
│ 2 │ 1 │ "668d393e858e8126275433046bbd35c6tywop" │ "aeqok" │ "furniture" │ "phone1" │
│ 3 │ 2 │ "aa5f360084278b35d746fa6af3a7a1a5ra3xe" │ "wa00e" │ "home goods" │ "phone2" │
...Other example invocations may include:
# read in a tab-delimited file
CSV.read(file; delim=' ')
# read in a comma-delimited file with missing values represented as '\N', such as a MySQL export
CSV.read(file; missingstring="\N")
# read a csv file that happens to have column names in the first column, and grouped data in rows instead of columns
CSV.read(file; transpose=true)
# manually provided column names; must match # of columns of data in file
# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file
CSV.read(file; header=["col1", "col2", "col3"])
# manually provided column names, even though the file itself has column names on the first row
# `datarow` is specified to ensure data parsing occurs at correct location
CSV.read(file; header=["col1", "col2", "col3"], datarow=2)
# types provided manually; as a vector, must match length of columns in actual data
CSV.read(file; types=[Int, Int, Float64])
# types provided manually; as a Dict, can specify columns by # or column name
CSV.read(file; types=Dict(3=>Float64, 6=>String))
CSV.read(file; types=Dict("col3"=>Float64, "col6"=>String))
# manually provided # of rows; if known beforehand, this will improve parsing speed
# this is also a way to limit the # of rows to be read in a file if only a sample is needed
CSV.read(file; rows=10000)
# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame
# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument
# `append=true` is passed
df = CSV.read(file)
df = CSV.read(file2, df; append=true)
# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
db = SQLite.DB()
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")CSV.validate — Function.CSV.validate(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...) => typeof(sink)
CSV.validate(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...) => Data.Sink
Takes the same positional & keyword arguments as CSV.read, but provides detailed information as to why reading a csv file failed. Useful for cases where reading fails and it's not clear whether it's due to a row havign too many columns, or wrong types, or what have you.
CSV.write — Function.CSV.write(file_or_io::Union{AbstractString,IO}, source::Type{T}, args...; kwargs...) => CSV.Sink
CSV.write(file_or_io::Union{AbstractString,IO}, source::Data.Source; kwargs...) => CSV.Sink
write a Data.Source out to a file_or_io.
Positional Arguments:
file_or_io; can be a file name (string) or otherIOinstancesourcecan be the type ofData.Source, plus any requiredargs..., or an already constructedData.Sourcecan be passsed in directly (2nd method)
Keyword Arguments:
delim::Union{Char,UInt8}; how fields in the file will be delimited; default isUInt8(',')quotechar::Union{Char,UInt8}; the character that indicates a quoted field that may contain thedelimor newlines; default isUInt8('"')escapechar::Union{Char,UInt8}; the character that escapes aquotecharin a quoted field; default isUInt8('\')missingstring::String; the ascii string that indicates how missing values will be represented in the dataset; default is the empty string""dateformat; how dates/datetimes will be represented in the dataset; default is ISO-8601yyyy-mm-ddTHH:MM:SS.sheader::Bool; whether to write out the column names fromsourcecolnames::Vector{String}; a vector of string column names to be used when writing the header rowappend::Bool; start writing data at the end ofio; by default,iowill be reset to the beginning or overwritten before writingtransforms::Dict{Union{String,Int},Function}; a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.
A few example invocations include:
# write out a DataFrame `df` to a file name "out.csv" with all defaults, including comma as delimiter
CSV.write("out.csv", df)
# write out a DataFrame, this time as a tab-delimited file
CSV.write("out.csv", df; delim=' ')
# write out a DataFrame, with missing values represented by the string "NA"
CSV.write("out.csv", df; missingstring="NA")
# write out a "header-less" file, with actual data starting on row 1
CSV.write("out.csv", df; header=false)
# write out a DataFrame `df` twice to a file, the resulting file with have twice the # of rows as the DataFrame
# note the usage of the keyword argument `append=true` in the 2nd call
CSV.write("out.csv", df)
CSV.write("out.csv", df; append=true)
# write a DataFrame out to an IOBuffer instead of a file
io = IOBuffer
CSV.write(io, df)
# write the result of an SQLite query out to a comma-delimited file
db = SQLite.DB()
sqlite_source = SQLite.Source(db, "select * from sqlite_table")
CSV.write("sqlite_table.csv", sqlite_source)Lower-level utilities
CSV.Source
CSV.Sink
CSV.Options
CSV.parsefield
CSV.readline
CSV.readsplitline
CSV.countlines