Home

CSV.jl Documentation

CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.

High-level interface

CSV.readFunction.

CSV.read(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...) => typeof(sink)

CSV.read(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...) => Data.Sink

parses a delimited file into a Julia structure (a DataFrame by default, but any valid Data.Sink may be requested).

Minimal error-reporting happens w/ CSV.read for performance reasons; for problematic csv files, try CSV.validate which takes exact same arguments as CSV.read and provides much more information for why reading the file failed.

Positional arguments:

  • fullpath; can be a file name (string) or other IO instance

  • sink::Type{T}; DataFrame by default, but may also be other Data.Sink types that support streaming via Data.Field interface; note that the method argument can be the type of Data.Sink, plus any required arguments the sink may need (args...). or an already constructed sink may be passed (2nd method above)

Keyword Arguments:

  • delim::Union{Char,UInt8}: how fields in the file are delimited; default ','

  • quotechar::Union{Char,UInt8}: the character that indicates a quoted field that may contain the delim or newlines; default '"'

  • escapechar::Union{Char,UInt8}: the character that escapes a quotechar in a quoted field; default '\'

  • missingstring::String: indicates how missing values are represented in the dataset; default ""

  • dateformat::Union{AbstractString,Dates.DateFormat}: how dates/datetimes are represented in the dataset; default Base.Dates.ISODateTimeFormat

  • decimal::Union{Char,UInt8}: character to recognize as the decimal point in a float number, e.g. 3.14 or 3,14; default '.'

  • truestring: string to represent true::Bool values in a csv file; default "true". Note that truestring and falsestring cannot start with the same character.

  • falsestring: string to represent false::Bool values in a csv file; default "false"

  • header: column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column names

  • datarow::Int: specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specify datarow=1

  • types: column types can be provided manually as a complete Vector{Type}, or in a Dict to reference individual columns by name or number

  • allowmissing::Symbol=:all: indicates whether columns should allow for missing values or not, that is whether their element type should be Union{T,Missing}; by default, all columns are allowed to contain missing values. If set to :none, no column can contain missing values, and if set to :auto, only colums which contain missing values in the first rows_for_type_detect rows are allowed to contain missing values. Column types specified via types are not affected by this argument.

  • footerskip::Int: indicates the number of rows to skip at the end of the file

  • rows_for_type_detect::Int=100: indicates how many rows should be read to infer the types of columns

  • rows::Int: indicates the total number of rows to read from the file; by default the file is pre-parsed to count the # of rows; -1 can be passed to skip a full-file scan, but the Data.Sink must be set up to account for a potentially unknown # of rows

  • use_mmap::Bool=true: whether the underlying file will be mmapped or not while parsing; note that on Windows machines, the underlying file will not be "deletable" until Julia GC has run (can be run manually via gc()) due to the use of a finalizer when reading the file.

  • append::Bool=false: if the sink argument provided is an existing table, append=true will append the source's data to the existing data instead of doing a full replace

  • transforms::Dict{Union{String,Int},Function}: a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.

  • transpose::Bool=false: when reading the underlying csv data, rows should be treated as columns and columns as rows, thus the resulting dataset will be the "transpose" of the actual csv data.

  • categorical::Bool=true: read string column as a CategoricalArray (ref), as long as the % of unique values seen during type detection is less than 67%. This will dramatically reduce memory use in cases where the number of unique values is small.

  • strings::Symbol=:intern: indicates how to treat strings. By default strings are interned, i.e. a global pool of already encountered strings is used to avoid allocating a new String object for each field. If strings=:raw, string interning is disabled, which can be faster when most strings are unique. If strings=:weakref, WeakRefStrings package is used used to speed up file parsing by avoiding copies; can only be used for the Sink objects that support WeakRefStringArray columns (note that WeakRefStringArray still returns regular String elements).

Example usage:

julia> dt = CSV.read("bids.csv")
7656334×9 DataFrames.DataFrame
│ Row     │ bid_id  │ bidder_id                               │ auction │ merchandise      │ device      │
├─────────┼─────────┼─────────────────────────────────────────┼─────────┼──────────────────┼─────────────┤
│ 1       │ 0       │ "8dac2b259fd1c6d1120e519fb1ac14fbqvax8" │ "ewmzr" │ "jewelry"        │ "phone0"    │
│ 2       │ 1       │ "668d393e858e8126275433046bbd35c6tywop" │ "aeqok" │ "furniture"      │ "phone1"    │
│ 3       │ 2       │ "aa5f360084278b35d746fa6af3a7a1a5ra3xe" │ "wa00e" │ "home goods"     │ "phone2"    │
...

Other example invocations may include:

# read in a tab-delimited file
CSV.read(file; delim='	')

# read in a comma-delimited file with missing values represented as '\N', such as a MySQL export
CSV.read(file; missingstring="\N")

# read a csv file that happens to have column names in the first column, and grouped data in rows instead of columns
CSV.read(file; transpose=true)

# manually provided column names; must match # of columns of data in file
# this assumes there is no header row in the file itself, so data parsing will start at the very beginning of the file
CSV.read(file; header=["col1", "col2", "col3"])

# manually provided column names, even though the file itself has column names on the first row
# `datarow` is specified to ensure data parsing occurs at correct location
CSV.read(file; header=["col1", "col2", "col3"], datarow=2)

# types provided manually; as a vector, must match length of columns in actual data
CSV.read(file; types=[Int, Int, Float64])

# types provided manually; as a Dict, can specify columns by # or column name
CSV.read(file; types=Dict(3=>Float64, 6=>String))
CSV.read(file; types=Dict("col3"=>Float64, "col6"=>String))

# manually provided # of rows; if known beforehand, this will improve parsing speed
# this is also a way to limit the # of rows to be read in a file if only a sample is needed
CSV.read(file; rows=10000)

# for data files, `file` and `file2`, with the same structure, read both into a single DataFrame
# note that `df` is used as a 2nd argument in the 2nd call to `CSV.read` and the keyword argument
# `append=true` is passed
df = CSV.read(file)
df = CSV.read(file2, df; append=true)

# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
db = SQLite.DB()
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
source
CSV.validateFunction.

CSV.validate(fullpath::Union{AbstractString,IO}, sink::Type{T}=DataFrame, args...; kwargs...) => typeof(sink)

CSV.validate(fullpath::Union{AbstractString,IO}, sink::Data.Sink; kwargs...) => Data.Sink

Takes the same positional & keyword arguments as CSV.read, but provides detailed information as to why reading a csv file failed. Useful for cases where reading fails and it's not clear whether it's due to a row havign too many columns, or wrong types, or what have you.

source
CSV.writeFunction.

CSV.write(file_or_io::Union{AbstractString,IO}, source::Type{T}, args...; kwargs...) => CSV.Sink

CSV.write(file_or_io::Union{AbstractString,IO}, source::Data.Source; kwargs...) => CSV.Sink

write a Data.Source out to a file_or_io.

Positional Arguments:

  • file_or_io; can be a file name (string) or other IO instance

  • source can be the type of Data.Source, plus any required args..., or an already constructed Data.Source can be passsed in directly (2nd method)

Keyword Arguments:

  • delim::Union{Char,UInt8}; how fields in the file will be delimited; default is UInt8(',')

  • quotechar::Union{Char,UInt8}; the character that indicates a quoted field that may contain the delim or newlines; default is UInt8('"')

  • escapechar::Union{Char,UInt8}; the character that escapes a quotechar in a quoted field; default is UInt8('\')

  • missingstring::String; the ascii string that indicates how missing values will be represented in the dataset; default is the empty string ""

  • dateformat; how dates/datetimes will be represented in the dataset; default is ISO-8601 yyyy-mm-ddTHH:MM:SS.s

  • header::Bool; whether to write out the column names from source

  • colnames::Vector{String}; a vector of string column names to be used when writing the header row

  • append::Bool; start writing data at the end of io; by default, io will be reset to the beginning or overwritten before writing

  • transforms::Dict{Union{String,Int},Function}; a Dict of transforms to apply to values as they are parsed. Note that a column can be specified by either number or column name.

A few example invocations include:

# write out a DataFrame `df` to a file name "out.csv" with all defaults, including comma as delimiter
CSV.write("out.csv", df)

# write out a DataFrame, this time as a tab-delimited file
CSV.write("out.csv", df; delim='	')

# write out a DataFrame, with missing values represented by the string "NA"
CSV.write("out.csv", df; missingstring="NA")

# write out a "header-less" file, with actual data starting on row 1
CSV.write("out.csv", df; header=false)

# write out a DataFrame `df` twice to a file, the resulting file with have twice the # of rows as the DataFrame
# note the usage of the keyword argument `append=true` in the 2nd call
CSV.write("out.csv", df)
CSV.write("out.csv", df; append=true)

# write a DataFrame out to an IOBuffer instead of a file
io = IOBuffer
CSV.write(io, df)

# write the result of an SQLite query out to a comma-delimited file
db = SQLite.DB()
sqlite_source = SQLite.Source(db, "select * from sqlite_table")
CSV.write("sqlite_table.csv", sqlite_source)
source

Lower-level utilities

CSV.SourceType.

A type that satisfies the Data.Source interface in the DataStreams.jl package.

A CSV.Source can be manually constructed in order to be re-used multiple times.

CSV.Source(file_or_io; kwargs...) => CSV.Source

Note that a filename string can be provided or any IO type. For the full list of supported keyword arguments, see the docs for CSV.read or type ?CSV.read at the repl

An example of re-using a CSV.Source is:

# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
source
CSV.SinkType.

A type that satisfies the Data.Sink interface in the DataStreams.jl package.

A CSV.Sink can be manually constructed in order to be re-used multiple times.

CSV.Sink(file_or_io; kwargs...) => CSV.Sink

Note that a filename string can be provided or any IO type. For the full list of supported keyword arguments, see the docs for CSV.write or type ?CSV.write at the repl

An example of re-using a CSV.Sink is:

# manually construct a `CSV.Source` once, then stream its data to both a DataFrame
# and SQLite table `sqlite_table` in the SQLite database `db`
# note the use of `CSV.reset!` to ensure the `source` can be streamed from again
source = CSV.Source(file)
df1 = CSV.read(source, DataFrame)
CSV.reset!(source)
sq1 = CSV.read(source, SQLite.Sink, db, "sqlite_table")
source
CSV.OptionsType.

Represents the various configuration settings for delimited text file parsing.

Keyword Arguments:

  • delim::Union{Char,UInt8}: how fields in the file are delimited; default ','

  • quotechar::Union{Char,UInt8}: the character that indicates a quoted field that may contain the delim or newlines; default '"'

  • escapechar::Union{Char,UInt8}: the character that escapes a quotechar in a quoted field; default '\'

  • missingstring::String: indicates how missing values are represented in the dataset; default ""

  • dateformat::Union{AbstractString,Dates.DateFormat}: how dates/datetimes are represented in the dataset; default Base.Dates.ISODateTimeFormat

  • decimal::Union{Char,UInt8}: character to recognize as the decimal point in a float number, e.g. 3.14 or 3,14; default '.'

  • truestring: string to represent true::Bool values in a csv file; default "true". Note that truestring and falsestring cannot start with the same character.

  • falsestring: string to represent false::Bool values in a csv file; default "false"

  • internstrings: whether strings should be interned, rather than creating a new object for each string field; default true

source
CSV.parsefieldFunction.

CSV.parsefield{T}(io::IO, ::Type{T}, opt::CSV.Options=CSV.Options(), row=0, col=0) => Union{T, Missing} CSV.parsefield{T}(s::CSV.Source, ::Type{T}, row=0, col=0) => Union{T, Missing}`

io is an IO type that is positioned at the first byte/character of an delimited-file field (i.e. a single cell) leading whitespace is ignored for Integer and Float types. Specialized methods exist for Integer, Float, String, Date, and DateTime. For other types T, a generic fallback requires parse(T, str::String) to be defined. the field value may also be wrapped in opt.quotechar; two consecutive opt.quotechar results in a missing field opt.missingstring is also checked if there is a custom value provided (i.e. "NA", "\N", etc.) For numeric fields, if field is non-missing and non-digit characters are encountered at any point before a delimiter or newline, an error is thrown

The second method of CSV.parsefield operates on a CSV.Source directly allowing for easy usage when writing custom parsing routines. Do note, however, that the row and col arguments are for error-reporting purposes only. A CSV.Source maintains internal state with regards to the underlying data buffer and can only parse fields sequentially. This means that CSV.parsefield needs to be called somewhat like:

source = CSV.Source(file)

types = Data.types(source)

for col = 1:length(types)
    println(get(CSV.parsefield(source, types[col]), """"))
end
source
CSV.readlineFunction.
CSV.readline(io::IO, q='"', e='\', buf::IOBuffer=IOBuffer()) => String
CSV.readline(source::CSV.Source) => String

Read a single line from io (any IO type) or a CSV.Source as a String object. This function mirrors Base.readline except that the newlines within quoted fields are ignored (e.g. value1, value2, "value3 with embedded newlines"). Uses buf::IOBuffer for intermediate IO operations, if specified.

source
CSV.readsplitline!Function.
CSV.readsplitline!(vals::Vector{RawField}, io, d=',', q='"', e='\', buf::IOBuffer=IOBuffer())
CSV.readsplitline!(vals::Vector{RawField}, source::CSV.Source)

Read a single, delimited line from io (any IO type) or a CSV.Source as a Vector{String} and store the values in vals. Delimited fields are separated by d, quoted by q and escaped by e ASCII characters. The contents of vals are replaced. Uses buf::IOBuffer for intermediate IO operations, if specified.

source
CSV.countlinesFunction.
CSV.countlines(io::IO, quotechar, escapechar) => Int
CSV.countlines(source::CSV.Source) => Int

Count the number of lines in a file, accounting for potentially embedded newlines in quoted fields.

source