CSV.jl Documentation
CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.
High-level interface
CSV.read
— Function.CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...)
=> typeof(sink)
CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...)
=> Data.Sink
parses a delimited file into a Julia structure (a DataFrame by default, but any valid Data.Sink
may be requested).
Positional arguments:
fullpath
; can be a file name (string) or otherIO
instancesink::Type{T}
;DataFrame
by default, but may also be otherData.Sink
types that support streaming viaData.Field
interface; note that the method argument can be the type ofData.Sink
, plus any required arguments the sink may need (args...
). or an already constructedsink
may be passed (2nd method above)
Keyword Arguments:
delim::Union{Char,UInt8}
; a single character or ascii-compatible byte that indicates how fields in the file are delimited; default isUInt8(',')
quotechar::Union{Char,UInt8}
; the character that indicates a quoted field that may contain thedelim
or newlines; default isUInt8('"')
escapechar::Union{Char,UInt8}
; the character that escapes aquotechar
in a quoted field; default isUInt8('\')
null::String
; an ascii string that indicates how NULL values are represented in the dataset; default is the empty string,""
header
; column names can be provided manually as a complete Vector{String}, or as an Int/Range which indicates the row/rows that contain the column namesdatarow::Int
; specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specifydatarow=1
types
; column types can be provided manually as a complete Vector{DataType}, or in a Dict to reference individual columns by name or numbernullable::Bool
; indicates whether values can be nullable or not;true
by default. If set tofalse
and missing values are encountered, aNullException
will be throwndateformat::Union{AbstractString,Dates.DateFormat}
; how all dates/datetimes in the dataset are formattedfooterskip::Int
; indicates the number of rows to skip at the end of the filerows_for_type_detect::Int=100
; indicates how many rows should be read to infer the types of columnsrows::Int
; indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows;-1
can be passed to skip a full-file scan, but theData.Sink
must be setup account for a potentially unknown # of rowsuse_mmap::Bool=true
; whether the underlying file will be mmapped or not while parsingappend::Bool=false
; if thesink
argument provided is an existing table,append=true
will append the source's data to the existing data instead of doing a full replacetransforms::Dict{Union{String,Int},Function}
; a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.
Note by default, "string" or text columns will be parsed as the WeakRefString
type. This is a custom type that only stores a pointer to the actual byte data + the number of bytes. To convert a String
to a standard Julia string type, just call string(::WeakRefString)
, this also works on an entire column. Oftentimes, however, it can be convenient to work with WeakRefStrings
depending on the ultimate use, such as transfering the data directly to another system and avoiding all the intermediate copying.
Example usage:
julia> dt = CSV.read("bids.csv")
7656334×9 DataFrames.DataFrame
│ Row │ bid_id │ bidder_id │ auction │ merchandise │ device │
├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤
│ 1 │ 0 │ "8dac2b259fd1c6d1120e519fb1ac14fbqvax8" │ "ewmzr" │ "jewelry" │ "phone0" │
│ 2 │ 1 │ "668d393e858e8126275433046bbd35c6tywop" │ "aeqok" │ "furniture" │ "phone1" │
│ 3 │ 2 │ "aa5f360084278b35d746fa6af3a7a1a5ra3xe" │ "wa00e" │ "home goods" │ "phone2" │
...
Other example invocations may include:
# read in a tab-delimited file
CSV.read(file; delim=' ')
# read in a comma-delimited file with null values represented as 'N', such as a MySQL export
CSV.read(file; null="\N")
# manually provided column names; must match # of columns of data in file
# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file
CSV.read(file; header=["col1", "col2", "col3"])
# manually provided column names, even though the file itself has column names on the first row
# `datarow` is specified to ensure data parsing occurs at correct location
CSV.read(file; header=["col1", "col2", "col3"], datarow=2)
# types provided manually; as a vector, must match length of columns in actual data
CSV.read(file; types=[Int, Int, Float64])
# types provided manually; as a Dict, can specify columns by # or column name
CSV.read(file; types=Dict(3=>Float64, 6=>String))
CSV.read(file; types=Dict("col3"=>Float64, "col6"=>String))
# manually provided # of rows; if known beforehand, this will improve parsing speed
# this is also a way to limit the # of rows to be read in a file if only a sample is needed
CSV.read(file; rows=10000)
# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame
# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument
# `append=true` is passed
df = CSV.read(file)
df = CSV.read(file2, df; append=true)
# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
db = SQLite.DB()
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
CSV.write
— Function.CSV.write(file_or_io::Union{AbstractString,IO}, source::Type{T}, args...; kwargs...)
=> CSV.Sink
CSV.write(file_or_io::Union{AbstractString,IO}, source::Data.Source; kwargs...)
=> CSV.Sink
write a Data.Source
out to a file_or_io
.
Positional Arguments:
file_or_io
; can be a file name (string) or otherIO
instancesource
can be the type ofData.Source
, plus any requiredargs...
, or an already constructedData.Source
can be passsed in directly (2nd method)
Keyword Arguments:
delim::Union{Char,UInt8}
; how fields in the file will be delimited; default isUInt8(',')
quotechar::Union{Char,UInt8}
; the character that indicates a quoted field that may contain thedelim
or newlines; default isUInt8('"')
escapechar::Union{Char,UInt8}
; the character that escapes aquotechar
in a quoted field; default isUInt8('\')
null::String
; the ascii string that indicates how NULL values will be represented in the dataset; default is the emtpy string""
dateformat
; how dates/datetimes will be represented in the dataset; default is ISO-8601yyyy-mm-ddTHH:MM:SS.s
header::Bool
; whether to write out the column names fromsource
colnames::Vector{String}
; a vector of string column names to be used when writing the header rowappend::Bool
; start writing data at the end ofio
; by default,io
will be reset to the beginning or overwritten before writingtransforms::Dict{Union{String,Int},Function}
; a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.
A few example invocations include:
# write out a DataFrame `df` to a file name "out.csv" with all defaults, including comma as delimiter
CSV.write("out.csv", df)
# write out a DataFrame, this time as a tab-delimited file
CSV.write("out.csv", df; delim=' ')
# write out a DataFrame, with null values represented by the string "NA"
CSV.write("out.csv", df; null="NA")
# write out a "header-less" file, with actual data starting on row 1
CSV.write("out.csv", df; header=false)
# write out a DataFrame `df` twice to a file, the resulting file with have twice the # of rows as the DataFrame
# note the usage of the keyword argument `append=true` in the 2nd call
CSV.write("out.csv", df)
CSV.write("out.csv", df; append=true)
# write a DataFrame out to an IOBuffer instead of a file
io = IOBuffer
CSV.write(io, df)
# write the result of an SQLite query out to a comma-delimited file
db = SQLite.DB()
sqlite_source = SQLite.Source(db, "select * from sqlite_table")
CSV.write("sqlite_table.csv", sqlite_source)
Lower-level utilities
CSV.Source
— Type.A type that satisfies the Data.Source
interface in the DataStreams.jl
package.
A CSV.Source
can be manually constructed in order to be re-used multiple times.
CSV.Source(file_or_io; kwargs...) => CSV.Source
Note that a filename string can be provided or any IO
type. For the full list of supported keyword arguments, see the docs for CSV.read
or type ?CSV.read
at the repl
An example of re-using a CSV.Source
is:
# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
CSV.Sink
— Type.A type that satisfies the Data.Sink
interface in the DataStreams.jl
package.
A CSV.Sink
can be manually constructed in order to be re-used multiple times.
CSV.Sink(file_or_io; kwargs...) => CSV.Sink
Note that a filename string can be provided or any IO
type. For the full list of supported keyword arguments, see the docs for CSV.write
or type ?CSV.write
at the repl
An example of re-using a CSV.Sink
is:
# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
CSV.Options
— Type.Represents the various configuration settings for delimited text file parsing.
Keyword Arguments:
delim::Union{Char,UInt8}
; how fields in the file are delimitedquotechar::Union{Char,UInt8}
; the character that indicates a quoted field that may contain thedelim
or newlinesescapechar::Union{Char,UInt8}
; the character that escapes aquotechar
in a quoted fieldnull::String
; indicates how NULL values are represented in the datasetdateformat::Union{AbstractString,Dates.DateFormat}
; how dates/datetimes are represented in the dataset
CSV.parsefield
— Function.CSV.parsefield{T}(io::IO, ::Type{T}, opt::CSV.Options=CSV.Options(), row=0, col=0)
=> Nullable{T}
CSV.parsefield{T}(s::CSV.Source, ::Type{T}, row=0, col=0)
=> Nullable{T}
`
io
is an IO
type that is positioned at the first byte/character of an delimited-file field (i.e. a single cell) leading whitespace is ignored for Integer and Float types. returns a Nullable{T}
saying whether the field contains a null value or not (empty field, missing value) field is null if the next delimiter or newline is encountered before any other characters. Specialized methods exist for Integer, Float, String, Date, and DateTime. For other types T
, a generic fallback requires parse(T, str::String)
to be defined. the field value may also be wrapped in opt.quotechar
; two consecutive opt.quotechar
results in a null field opt.null
is also checked if there is a custom value provided (i.e. "NA", "\N", etc.) For numeric fields, if field is non-null and non-digit characters are encountered at any point before a delimiter or newline, an error is thrown
The second method of CSV.parsefield
operates on a CSV.Source
directly allowing for easy usage when writing custom parsing routines. Do note, however, that the row
and col
arguments are for error-reporting purposes only. A CSV.Source
maintains internal state with regards to the underlying data buffer and can only parse fields sequentially. This means that CSV.parsefield
needs to be called somewhat like:
source = CSV.Source(file)
types = Data.types(source)
for col = 1:length(types)
println(get(CSV.parsefield(source, types[col]), """"))
end
CSV.readline
— Function.CSV.readline(io::IO, q='"', e='\', buf::IOBuffer=IOBuffer())
=> String
CSV.readline(source::CSV.Source)
=> String
read a single line from io
(any IO
type) or a CSV.Source
as a string, accounting for potentially embedded newlines in quoted fields (e.g. value1, value2, "value3 with embedded newlines"). Can optionally provide a buf::IOBuffer
type for buffer reuse
This function basically mirrors Base.readline
except it can account for quoted newlines to not as the true end of a line.
CSV.readsplitline
— Function.CSV.readsplitline(io, d=',', q='"', e='\', buf::IOBuffer=IOBuffer())
=> Vector{String}
CSV.readsplitline(source::CSV.Source)
=> Vector{String}
read a single, delimited line from io
(any IO
type) or a CSV.Source
as a Vector{String}
delimited fields are separated by an ascii character d
). Can optionally provide a buf::IOBuffer
type for buffer reuse
CSV.countlines
— Function.CSV.countlines(io::IO, quotechar, escapechar)
=> Int
CSV.countlines(source::CSV.Source)
=> Int
count the number of lines in a file, accounting for potentially embedded newlines in quoted fields