CSV.jl Documentation

CSV.jl is built to be a fast and flexible pure-Julia library for handling delimited text files.

Getting Started

CSV.jl provides a number of utilities for working with delimited files. CSV.File provides a way to read files into columns of data, detecting column types. CSV.Rows provides a row iterator for looping over rows in a file. Inputs to either should be filenames as Strings or FilePaths, or byte vectors (AbstractVector{UInt8}). To read other IO inputs, just call read(io) and pass the bytes directly to CSV.File or CSV.Rows.

If julia is started with multiple threads (i.e. julia -t 4, or with JULIA_NUM_THREADS environment variable set), CSV.File will use those threads by default to parse large enough files. There are a few keyword arguments to control multithreaded parsing, including:

  • threaded=false: turn off multithreaded parsing, the file will be read sequentially using a single thread
  • tasks=N: control how many tasks/chunks are used to break up a file; by default, Threads.nthreads() will be used
  • lines_to_check=M: when a file is split into chunks, the parser must then find valid starts/ends to rows; this keyword argument controls how many lines are checked to ensure valid rows are found; for files with very large quoted text fields, it may be required to use a higher number here (10, 30, etc.)

Key Functions

CSV.FileType
CSV.File(source; kwargs...) => CSV.File

Read a UTF-8 CSV input and return a CSV.File object.

The source argument can be one of:

  • filename given as a string or FilePaths.jl type
  • an AbstractVector{UInt8} like a byte buffer or codeunits(string)
  • an IOBuffer

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

using HTTP, CSV
f = CSV.File(HTTP.get(url).body)

For other IO or Cmd inputs, you can pass them like: f = CSV.File(read(obj)).

Opens the file and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).

For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, source, enc"ISO-8859-1")).

The returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.File(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

By supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:

# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame

# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)

# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows

Parsing options:

  • missingstring: either a String, or Vector{String} to use as sentinel values that will be parsed as missing; by default, only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header
  • typemap::Dict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict}=0.25: [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (by default 0.25, meaning that if the # of unique strings in a column is under 25.0%, it will be pooled); if an AbstractVector, each element should be Bool or Real and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool or Real value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=WeakRefStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.ChunksType
CSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks

Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.

The ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.

Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.

This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows

Parsing options:

  • missingstring: either a String, or Vector{String} to use as sentinel values that will be parsed as missing; by default, only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header
  • typemap::Dict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict}=0.25: [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (by default 0.25, meaning that if the # of unique strings in a column is under 25.0%, it will be pooled); if an AbstractVector, each element should be Bool or Real and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool or Real value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=WeakRefStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.RowsType
CSV.Rows(source; kwargs...) => CSV.Rows

Read a csv input returning a CSV.Rows object.

The source argument can be one of:

  • filename given as a string or FilePaths.jl type
  • an AbstractVector{UInt8} like a byte buffer or codeunits(string)
  • an IOBuffer

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

f = CSV.Rows(HTTP.get(url).body)

For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).

While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:

  • Very minimal memory footprint; while iterating, only the current row values are buffered
  • Only provides row access via iteration; to access columns, one can stream the rows into a table type
  • Performs no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments

Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.Rows(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use threaded=false to ensure an exact limit if necessary
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows

Parsing options:

  • missingstring: either a String, or Vector{String} to use as sentinel values that will be parsed as missing; by default, only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header
  • typemap::Dict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict}=0.25: [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (by default 0.25, meaning that if the # of unique strings in a column is under 25.0%, it will be pooled); if an AbstractVector, each element should be Bool or Real and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool or Real value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=WeakRefStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.writeFunction
CSV.write(file, table; kwargs...) => file
table |> CSV.write(file; kwargs...) => file

Write a Tables.jl interface input to a csv file, given as an IO argument or String/FilePaths.jl type representing the file name to write to. Alternatively, CSV.RowWriter creates a row iterator, producing a csv-formatted string for each row in an input table.

Supported keyword arguments include:

  • bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
  • delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
  • quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
  • openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
  • escapechar::Char='"': ascii character used to escape quote characters in a text field
  • missingstring::String="": string to print for missing values
  • dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
  • append=false: whether to append writing to an existing file/IO, if true, it will not write column names by default
  • writeheader=!append: whether to write an initial row of delimited column names, not written by default if appending
  • header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
  • newline='\n': character or string to use to separate rows (lines in the csv file)
  • quotestrings=false: whether to force all strings to be quoted or not
  • decimal='.': character to use as the decimal point when writing floating point numbers
  • transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
  • bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
  • partition::Bool=false: by passing true, the table argument is expected to implement Tables.partitions and the file argument can either be an indexable collection of IO, file Strings, or a single file String that will have an index appended to the name
source
CSV.RowWriterType
CSV.RowWriter(table; kwargs...)

Creates an iterator that produces csv-formatted strings for each row in the input table.

Supported keyword arguments include:

  • bufsize::Int=2^22: The length of the buffer to use when writing each csv-formatted row; default 4MB; if a row is larger than the bufsize an error is thrown
  • delim::Union{Char, String}=',': a character or string to print out as the file's delimiter
  • quotechar::Char='"': ascii character to use for quoting text fields that may contain delimiters or newlines
  • openquotechar::Char: instead of quotechar, use openquotechar and closequotechar to support different starting and ending quote characters
  • escapechar::Char='"': ascii character used to escape quote characters in a text field
  • missingstring::String="": string to print for missing values
  • dateformat=Dates.default_format(T): the date format string to use for printing out Date & DateTime columns
  • header: pass a list of column names (Symbols or Strings) to use instead of the column names of the input table
  • newline='\n': character or string to use to separate rows (lines in the csv file)
  • quotestrings=false: whether to force all strings to be quoted or not
  • decimal='.': character to use as the decimal point when writing floating point numbers
  • transform=(col,val)->val: a function that is applied to every cell e.g. we can transform all nothing values to missing using (col, val) -> something(val, missing)
  • bom=false: whether to write a UTF-8 BOM header (0xEF 0xBB 0xBF) or not
source

Examples

Basic

File

col1,col2,col3,col4,col5,col6,col7,col8
,1,1.0,1,one,2019-01-01,2019-01-01T00:00:00,true
,2,2.0,2,two,2019-01-02,2019-01-02T00:00:00,false
,3,3.0,3.14,three,2019-01-03,2019-01-03T00:00:00,true

Syntax

CSV.File(file)

By default, CSV.File will automatically detect this file's delimiter ',', and the type of each column. By default, it treats "empty fields" as missing (the entire first column in this example). It also automatically handles promoting types, like the 4th column, where the first two values are Int, but the 3rd row has a Float64 value (3.14). The resulting column's type will be Float64. Parsing can detect Int64, Float64, Date, DateTime, Time and Bool types, with String as the fallback type for any column.

Auto-Delimiter Detection

File

col1|col2
1|2
3|4

Syntax

CSV.File(file)

By default, CSV.File will try to detect a file's delimiter from the first 10 lines of the file; candidate delimiters include ',', '\t', ' ', '|', ';', and ':'. If it can't auto-detect the delimiter, it will assume ','. If your file includes a different character or string delimiter, just pass delim=X where X is the character or string. For this file you could also do CSV.File(file; delim='|').

String Delimiter

File

col1::col2
1::2
3::4

Syntax

CSV.File(file; delim="::")

In this example, our file has fields separated by the string "::"; we can pass this as the delim keyword argument.

No Header

File

1,2,3
4,5,6
7,8,9

Syntax

CSV.File(file; header=false)
CSV.File(file; header=["col1", "col2", "col3"])
CSV.File(file; header=[:col1, :col2, :col3])

In this file, there is no header row that contains column names. In the first option, we pass header=false, and column names will be generated like [:Column1, :Column2, :Column3]. In the two latter examples, we pass our own explicit column names, either as Strings or Symbols.

Normalize Column Names

File

column one,column two, column three
1,2,3
4,5,6

Syntax

CSV.File(file; normalizenames=true)

In this file, our column names have spaces in them. It can be convenient with a CSV.File or DataFrame to access entire columns via property access, e.g. if f = CSV.File(file) with column names like [:col1, :col2], I can access the entire first column of the file like f.col1, or for the second, f.col2. The call of f.col1 actually gets rewritten to the function call getproperty(f, :col1), which is the function implemented in CSV.jl that returns the col1 column from the file. When a column name is not a single atom Julia identifier, this is inconvient, because f.column one is not valid, so I would have to manually call getproperty(f, Symbol("column one"). normalizenames=true comes to our rescue; it will replace invalid identifier characters with underscores to ensure each column is a valid Julia identifier, so for this file, we would end up with column names like [:column_one, :column_two]. You can call propertynames(f) on any CSV.File to see the parsed column names.

Datarow

File

col1,col2,col3
metadata1,metadata2,metadata3
extra1,extra2,extra3
1,2,3
4,5,6
7,8,9

Syntax

CSV.File(file; skipto=4)

This file has extra rows in between our header row col1,col2,col3 and the start of our data 1,2,3 on row 4. We can use the skipto keyword arguments to provide a row number where the "data" of our file begins.

Reading Chunks

File

col1,col2,col3
1,2,3
4,5,6
7,8,9
10,11,12
13,14,15
16,17,18
19,20,21

Syntax

CSV.File(file; limit=3)
CSV.File(file; skipto=4, limit=1)
CSV.File(file; skipto=7, footerskip=1)

In this example, we desire to only read a subset of rows from the file. Using the limit, skipto, and footerskip keyword arguments, we can specify the exact rows we wish to parse.

Transposed Data

File

col1,1,2,3
col2,4,5,6
col3,7,8,9

Syntax

CSV.File(file; transpose=true)

This file has the column names in the first column, and data that extends alongs rows horizontally. The data for col1 is all on the first row, similarly for col2 and its data on row 2. In this case, we wish to read the file "transposed", or treating rows as columns. By passing transpose=true, CSV.jl will read column names from the first column, and the data for each column from its corresponding row.

Commented Rows

File

col1,col2,col3
# this row is commented and we'd like to ignore it while parsing
1,2,3
4,5,6

Syntax

CSV.File(file; comment="#")
CSV.File(file; skipto=3)

This file has some rows that begin with the "#" string and denote breaks in the data for commentary. We wish to ignore these rows for purposes of reading data. We can pass comment="#" and parsing will ignore any row that begins with this string. Alternatively, we can pass skipto=3 for this example specifically since there is only the one row to skip.

Missing Strings

File

code,age,score
0,21,3.42
1,42,6.55
-999,81,NA
-999,83,NA

Syntax

CSV.File(file; missingstring="-999")
CSV.File(file; missingstring=["-999", "NA"])

In this file, our code column has two expected codes, 0 and 1, but also a few "invalid" codes, which are input as -999. We'd like to read the column as Int64, but treat the -999 values as "missing" values. By passing missingstring="-999", we signal that this value should be replaced with the literal missing value builtin to the Julia language. We can then do things like dropmissing(f.col1) to ignore those values, for example. In the second recommended syntax, we also want to treat the NA values in our score column as missing, so we pass both strings like missingstring=["-999", "NA"].

Fixed Width Files

File

col1    col2 col3
123431  2    3421
2355    346  7543

Syntax

CSV.File(file; delim=' ', ignorerepeated=true)

This is an example of a "fixed width" file, where each column is the same number of characters away from each other on each row. This is different from a normal delimited file where each occurence of a delimiter indicates a separate field. With fixed width, however, fields are "padded" with extra delimiters (in this case ' ') so that each column is the same number of characters each time. In addition to our delim, we can pass ignorerepeated=true, which tells parsing that consecutive delimiters should be treated as a single delimiter.

Quoted & Escaped Fields

File

col1,col2
"quoted field with a delimiter , inside","quoted field that contains a \\n newline and ""inner quotes"""
unquoted field,unquoted field with "inner quotes"

Syntax

CSV.File(file; quotechar='"', escapechar='"')
CSV.File(file; openquotechar='"', closequotechar='"', escapechar='"')

In this file, we have a few "quoted" fields, which means the field's value starts and ends with quotechar (or openquotechar and closequotechar, respectively). Quoted fields allow the field to contain characters that would otherwise be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise signficant characters until the closing quote character is found. For quoted fields that need to also include the quote character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default values, so just doing CSV.File(file) would result in successful parsing.

DateFormat

File

code,date
0,2019/01/01
1,2019/01/02

Syntax

CSV.File(file; dateformat="yyyy/mm/dd")

In this file, our date column has dates that are formatted like yyyy/mm/dd. We can pass just such a string to the dateformat keyword argument to tell parsing to use it when looking for Date or DateTime columns. Note that currently, only a single dateformat string can be passed to parsing, meaning multiple columns with different date formats cannot all be parsed as Date/DateTime.

Custom Decimal Separator

File

col1;col2;col3
1,01;2,02;3,03
4,04;5,05;6,06

Syntax

CSV.File(file; delim=';', decimal=',')

In many places in the world, floating point number decimals are separated with a comma instead of a period (3,14 vs. 3.14). We can correctly parse these numbers by passing in the decimal=',' keyword argument. Note that we probably need to explicitly pass delim=';' in this case, since the parser will probably think that it detected ',' as the delimiter.

Custom Bool Strings

File

id,paid,attended
0,T,TRUE
1,F,TRUE
2,T,FALSE
3,F,FALSE

Syntax

CSV.File(file; truestrings=["T", "TRUE"], falsestrings=["F", "FALSE"])

By default, parsing only considers the string values true and false as valid Bool values. To consider alternative values, we can pass a Vector{String} to the truestrings and falsestrings keyword arguments.

Matrix-like Data

File

1.0 0.0 0.0
0.0 1.0 0.0
0.0 0.0 1.0

Syntax

CSV.File(file; header=false)
CSV.File(file; header=false, delim=' ', type=Float64)

This file contains a 3x3 identity matrix of Float64. By default, parsing will detect the delimiter and type, but we can also explicitly pass delim= ' ' and type=Float64, which tells parsing to explicitly treat each column as Float64, without having to guess the type on its own.

Providing Types

File

col1,col2,col3
1,2,3
4,5,invalid
6,7,8

Syntax

CSV.File(file; types=Dict(3 => Int))
CSV.File(file; types=Dict(:col3 => Int))
CSV.File(file; types=Dict("col3" => Int))
CSV.File(file; types=[Int, Int, Int])
CSV.File(file; types=[Int, Int, Int], silencewarnings=true)
CSV.File(file; types=[Int, Int, Int], strict=true)

In this file, our 3rd column has an invalid value on the 2nd row invalid. Let's imagine we'd still like to treat it as an Int column, and ignore the invalid value. The syntax examples provide several ways we can tell parsing to treat the 3rd column as Int, by referring to column index 3, or column name with Symbol or String. We can also provide an entire Vector of types for each column (and which needs to match the length of columns in the file). There are two additional keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like "warning: invalid Int64 value on row 2, column 3". In the fifth example, passing silencewarnings=true will suppress this warning printing. In the last syntax example, passing strict=true will result in an error being thrown during parsing.

Typemap

File

zipcode,score
03494,9.9
12345,6.7
84044,3.4

Syntax

CSV.File(file; typemap=Dict(Int => String))
CSV.File(file; types=Dict(:zipcode => String))

In this file, we have U.S. zipcodes in the first column that we'd rather not treat as Int, but parsing will detect it as such. In the first syntax example, we pass typemap=Dict(Int => String), which tells parsing to treat any detected Int columns as String instead. In the second syntax example, we alternatively set the zipcode column type manually.

Pooled Values

File

id,code
A18E9,AT
BF392,GC
93EBC,AT
54EE1,AT
8CD2E,GC

Syntax

CSV.File(file)
CSV.File(file; pool=0.4)
CSV.File(file; pool=0.6)

In this file, we have an id column and a code column. There can be advantages with various DataFrame/table operations like joining and grouping when String values are "pooled", meaning each unique value is mapped to a UInt64. By default, pool=0.1, so string columns with low cardinality are pooled by default. Via the pool keyword argument, we can provide greater control: pool=0.4 means that if 40% or less of a column's values are unique, then it will be pooled.

Select/Drop Columns From File

File

a,b,c
1,2,3
4,5,6
7,8,9

Syntax

# select
CSV.File(file; select=[1, 3])
CSV.File(file; select=[:a, :c])
CSV.File(file; select=["a", "c"])
CSV.File(file; select=[true, false, true])
CSV.File(file; select=(i, nm) -> i in (1, 3))
# drop
CSV.File(file; drop=[2])
CSV.File(file; drop=[:b])
CSV.File(file; drop=["b"])
CSV.File(file; drop=[false, true, false])
CSV.File(file; drop=(i, nm) -> i == 2)

For this file, we have columns a, b, and c; we might only be interested in the data in columns a and c. Using the select or drop keyword arguments can allow efficiently choosing of columns from a file; columns not selected or dropped will be efficiently skipped while parsing, allowing for performance boosts. The arguments to select or drop can be one of: AbstractVector{Int} a collection of column indices; AbstractVector{Symbol} or AbstractVector{String} a collection of column names as Symbol or String; AbstractVector{Bool} a collection of Bool equal in length to the # of columns signaling whether a column should be selected or dropped; or a selector/drop function of the form (i, name) -> keep_or_drop::Bool, i.e. it takes a column index i and column name name and returns a Bool signaling whether a column should be selected or dropped.

Non-UTF-8 character encodings

Like Julia in general, CSV.jl interprets strings as being encoded in UTF-8. The StringEncodings package has to be used to read or write CSV files in other character encodings.

Example: writing to and reading from a file encoded in ISO-8859-1

using CSV, DataFrames, StringEncodings

# writing to ISO-8859-1 file
a = DataFrame(a = ["café", "noël"])
open("a.csv", enc"ISO-8859-1", "w") do io
    CSV.write(io, a)
end

# reading from ISO-8859-1 file
CSV.File(open(read, "a.csv", enc"ISO-8859-1")) |> DataFrame

# alternative: reencode data to UTF-8 in a new file and read from it
open("a2.csv", "w") do io
    foreach(x -> println(io, x), eachline("a.csv", enc"ISO-8859-1"))
end
CSV.File("a2.csv") |> DataFrame

Reencoding to a new file as in the last example above avoids storing an additional copy of the data in memory, which may be useful for large files that do not fit in RAM.

Reading CSV from gzip (.gz) and zip files

Example: reading from a gzip (.gz) file

using CSV, DataFrames, CodecZlib, Mmap
a = DataFrame(a = 1:3)
CSV.write("a.csv", a)

# Windows users who do not have gzip available on the PATH should manually gzip the CSV
;gzip a.csv

a_copy = CSV.File(transcode(GzipDecompressor, Mmap.mmap("a.csv.gz"))) |> DataFrame

a == a_copy # true; restored successfully

Example: reading from a zip file

using ZipFile, CSV, DataFrames

a = DataFrame(a = 1:3)
CSV.write("a.csv", a)

# zip the file; Windows users who do not have zip available on the PATH can manually zip the CSV
# or write directly into the zip archive as shown below
;zip a.zip a.csv

# alternatively, write directly into the zip archive (without creating an unzipped csv file first)
z = ZipFile.Writer("a2.zip")
f = ZipFile.addfile(z, "a.csv", method=ZipFile.Deflate)
a |> CSV.write(f)
close(z)

# read file from zip archive
z = ZipFile.Reader("a.zip") # or "a2.zip"

# identify the right file in zip
a_file_in_zip = filter(x->x.name == "a.csv", z.files)[1]

a_copy = CSV.File(read(a_file_in_zip)) |> DataFrame

a == a_copy