CSV.jl Documentation
CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.
High-level interface
CSV.read
— Function.CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...)
=> typeof(sink)
CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...)
=> Data.Sink
parses a delimited file into a Julia structure (a DataFrame by default, but any valid Data.Sink
may be requested).
Minimal error-reporting happens w/ CSV.read
for performance reasons; for problematic csv files, try CSV.validate
which takes exact same arguments as CSV.read
and provides much more information for why reading the file failed.
Positional arguments:
fullpath
; can be a file name (string) or otherIO
instancesink::Type{T}
;DataFrame
by default, but may also be otherData.Sink
types that support streaming viaData.Field
interface; note that the method argument can be the type ofData.Sink
, plus any required arguments the sink may need (args...
). or an already constructedsink
may be passed (2nd method above)
Keyword Arguments:
delim::Union{Char,UInt8}
: how fields in the file are delimited; default','
quotechar::Union{Char,UInt8}
: the character that indicates a quoted field that may contain thedelim
or newlines; default'"'
escapechar::Union{Char,UInt8}
: the character that escapes aquotechar
in a quoted field; default'\'
missingstring::String
: indicates how missing values are represented in the dataset; default""
dateformat::Union{AbstractString,Dates.DateFormat}
: how dates/datetimes are represented in the dataset; defaultBase.Dates.ISODateTimeFormat
decimal::Union{Char,UInt8}
: character to recognize as the decimal point in a float number, e.g.3.14
or3,14
; default'.'
truestring
: string to representtrue::Bool
values in a csv file; default"true"
. Note thattruestring
andfalsestring
cannot start with the same character.falsestring
: string to representfalse::Bool
values in a csv file; default"false"
header
: column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column namesdatarow::Int
: specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specifydatarow=1
types
: column types can be provided manually as a complete Vector{Type}, or in a Dict to reference individual columns by name or numberallowmissing::Symbol=:all
: indicates whether columns should allow for missing values or not, that is whether their element type should beUnion{T,Missing}
; by default, all columns are allowed to contain missing values. If set to:none
, no column can contain missing values, and if set to:auto
, only colums which contain missing values in the firstrows_for_type_detect
rows are allowed to contain missing values. Column types specified viatypes
are not affected by this argument.footerskip::Int
: indicates the number of rows to skip at the end of the filerows_for_type_detect::Int=100
: indicates how many rows should be read to infer the types of columnsrows::Int
: indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows;-1
can be passed to skip a full-file scan, but theData.Sink
must be set up to account for a potentially unknown # of rowsuse_mmap::Bool=true
: whether the underlying file will be mmapped or not while parsing; note that on Windows machines, the underlying file will not be "deletable" until Julia GC has run (can be run manually viagc()
) due to the use of a finalizer when reading the file.append::Bool=false
: if thesink
argument provided is an existing table,append=true
will append the source's data to the existing data instead of doing a full replacetransforms::Dict{Union{String,Int},Function}
: a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.transpose::Bool=false
: when reading the underlying csv data, rows should be treated as columns and columns as rows, thus the resulting dataset will be the "transpose" of the actual csv data.categorical::Bool=true
: read string column as aCategoricalArray
(ref), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.weakrefstrings::Bool=true
: whether to useWeakRefStrings
package to speed up file parsing; can only be=true
for theSink
objects that supportWeakRefStringArray
columns. Note thatWeakRefStringArray
still returns regularString
elements.
Example usage:
julia> dt = CSV.read("bids.csv")
7656334×9 DataFrames.DataFrame
│ Row │ bid_id │ bidder_id │ auction │ merchandise │ device │
├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤
│ 1 │ 0 │ "8dac2b259fd1c6d1120e519fb1ac14fbqvax8" │ "ewmzr" │ "jewelry" │ "phone0" │
│ 2 │ 1 │ "668d393e858e8126275433046bbd35c6tywop" │ "aeqok" │ "furniture" │ "phone1" │
│ 3 │ 2 │ "aa5f360084278b35d746fa6af3a7a1a5ra3xe" │ "wa00e" │ "home goods" │ "phone2" │
...
Other example invocations may include:
# read in a tab-delimited file
CSV.read(file; delim=' ')
# read in a comma-delimited file with missing values represented as '\N', such as a MySQL export
CSV.read(file; missingstring="\N")
# read a csv file that happens to have column names in the first column, and grouped data in rows instead of columns
CSV.read(file; transpose=true)
# manually provided column names; must match # of columns of data in file
# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file
CSV.read(file; header=["col1", "col2", "col3"])
# manually provided column names, even though the file itself has column names on the first row
# `datarow` is specified to ensure data parsing occurs at correct location
CSV.read(file; header=["col1", "col2", "col3"], datarow=2)
# types provided manually; as a vector, must match length of columns in actual data
CSV.read(file; types=[Int, Int, Float64])
# types provided manually; as a Dict, can specify columns by # or column name
CSV.read(file; types=Dict(3=>Float64, 6=>String))
CSV.read(file; types=Dict("col3"=>Float64, "col6"=>String))
# manually provided # of rows; if known beforehand, this will improve parsing speed
# this is also a way to limit the # of rows to be read in a file if only a sample is needed
CSV.read(file; rows=10000)
# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame
# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument
# `append=true` is passed
df = CSV.read(file)
df = CSV.read(file2, df; append=true)
# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
db = SQLite.DB()
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
CSV.validate
— Function.CSV.validate(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...)
=> typeof(sink)
CSV.validate(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...)
=> Data.Sink
Takes the same positional & keyword arguments as CSV.read
, but provides detailed information as to why reading a csv file failed. Useful for cases where reading fails and it's not clear whether it's due to a row havign too many columns, or wrong types, or what have you.
CSV.write
— Function.CSV.write(file_or_io::Union{AbstractString,IO}, source::Type{T}, args...; kwargs...)
=> CSV.Sink
CSV.write(file_or_io::Union{AbstractString,IO}, source::Data.Source; kwargs...)
=> CSV.Sink
write a Data.Source
out to a file_or_io
.
Positional Arguments:
file_or_io
; can be a file name (string) or otherIO
instancesource
can be the type ofData.Source
, plus any requiredargs...
, or an already constructedData.Source
can be passsed in directly (2nd method)
Keyword Arguments:
delim::Union{Char,UInt8}
; how fields in the file will be delimited; default isUInt8(',')
quotechar::Union{Char,UInt8}
; the character that indicates a quoted field that may contain thedelim
or newlines; default isUInt8('"')
escapechar::Union{Char,UInt8}
; the character that escapes aquotechar
in a quoted field; default isUInt8('\')
missingstring::String
; the ascii string that indicates how missing values will be represented in the dataset; default is the empty string""
dateformat
; how dates/datetimes will be represented in the dataset; default is ISO-8601yyyy-mm-ddTHH:MM:SS.s
header::Bool
; whether to write out the column names fromsource
colnames::Vector{String}
; a vector of string column names to be used when writing the header rowappend::Bool
; start writing data at the end ofio
; by default,io
will be reset to the beginning or overwritten before writingtransforms::Dict{Union{String,Int},Function}
; a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.
A few example invocations include:
# write out a DataFrame `df` to a file name "out.csv" with all defaults, including comma as delimiter
CSV.write("out.csv", df)
# write out a DataFrame, this time as a tab-delimited file
CSV.write("out.csv", df; delim=' ')
# write out a DataFrame, with missing values represented by the string "NA"
CSV.write("out.csv", df; missingstring="NA")
# write out a "header-less" file, with actual data starting on row 1
CSV.write("out.csv", df; header=false)
# write out a DataFrame `df` twice to a file, the resulting file with have twice the # of rows as the DataFrame
# note the usage of the keyword argument `append=true` in the 2nd call
CSV.write("out.csv", df)
CSV.write("out.csv", df; append=true)
# write a DataFrame out to an IOBuffer instead of a file
io = IOBuffer
CSV.write(io, df)
# write the result of an SQLite query out to a comma-delimited file
db = SQLite.DB()
sqlite_source = SQLite.Source(db, "select * from sqlite_table")
CSV.write("sqlite_table.csv", sqlite_source)
Lower-level utilities
CSV.Source
CSV.Sink
CSV.Options
CSV.parsefield
CSV.readline
CSV.readsplitline
CSV.countlines