Reading
The format for this section will go through the various inputs/options supported by CSV.File
/CSV.read
, with notes about compatibility with the other reading functionality (CSV.Rows
, CSV.Chunks
, etc.).
input
A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.
Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}
) for parsing/processing, so with that in mind, let's look at the various supported input types:
- File name as a
String
orFilePath
; parsing will callMmap.mmap(string(file))
to get a byte buffer to the file data. For gzip compressed inputs, likefile.gz
, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passingbuffer_in_memory=true
. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO orVector{UInt8}
of decompressed data as input. Vector{UInt8}
orSubArray{UInt8, 1, Vector{UInt8}}
: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it likeCSV.File(IOBuffer(str))
IO
orCmd
: you can pass anIO
orCmd
directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, passbuffer_in_memory=true
.- For files from the web, you can call
HTTP.get(url).body
to request the file, then access the data as aVector{UInt8}
from thebody
field, which can be passed directly for parsing. For Julia 1.6+, you can also use theDownloads
stdlib, likeDownloads.download(url)
which can be passed to parsing
Examples
- StringEncodings.jl example
- Vector of inputs example
- Gzip input
- Delimited data in a string
- Data from the web
- Data in zip archive
header
The header
keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1
. Alternative valid augments for header
include:
Integer
, e.g.header=2
: provide the row number as anInteger
where the column names can be foundBool
, e.g.header=false
: no column names exist in the data; column names will be auto-generated depending on the # of columns, likeColumn1
,Column2
, etc.Vector{String}
orVector{Symbol}
: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of theVector
will be made and converted toVector{Symbol}
AbstractVector{<:Integer}
: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names
Examples
- Column names on second row
- No column names in the data
- Manually provide column names
- Multi-row column names
normalizenames
Controls whether column names will be "normalized" to valid Julia identifiers. By default, this is false
. If normalizenames=true
, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty
, like file.col1
. The identifier that comes after the .
must be valid, so spaces or identifiers starting with numbers aren't allowed.
Examples
skipto
An Integer
can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false
, or column names are provided manually as Vector{String}
or Vector{Symbol}
, the data is assumed to start on row 1, i.e. skipto=1
.
Examples
footerskip
An Integer
argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip
# of rows have been parsed, then parsing the entire file, stopping at the newly adjusted "end of file".
Examples
transpose
If transpose=true
is passed, data will be read "transposed", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a "long" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.
Examples
comment
A String
argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header
, skipto
, or footerskip
arguments, it should be noted that commented rows, while ignored, still count as "rows" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6
, even if row 5 is a commented row and will be ignored.
Examples
ignoreemptyrows
This argument specifies whether "empty rows", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false
, then for an empty row, all existing columns will have missing
assigned to their value for that row. Similar to commented rows, empty rows also still count as "rows" when any of the header
, skipto
, or footerskip
arguments are provided.
Examples
select
/ drop
Arguments that control which columns from the input data will actually be parsed and available after processing. select
controls which columns will be accessible after parsing while drop
controls which columns to ignore. Either argument can be provided as a vector of Integer
, String
, or Symbol
, specifying the column numbers or names to include/exclude. A vector of Bool
matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool
, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.
Examples
limit
An Integer
argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto
to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit
number of rows. To ensure an exact limit on larger files, also pass ntasks=1
to force single-threaded parsing.
Examples
ntasks
NOTE: not applicable to CSV.Rows
For large enough data inputs, ntasks
controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads()
, which is the number of threads the julia process was started with, either via julia -t N
or the JULIA_NUM_THREADS
environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1
. This argument is only applicable to CSV.File
, not CSV.Rows
. For CSV.Chunks
, it controls the total number of chunk iterations a large file will be split up into for parsing.
rows_to_check
NOTE: not applicable to CSV.Rows
When input data is large enough, parsing will attempt to "chunk" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check
number of rows to ensure the expected number of columns are present.
source
NOTE: only applicable to vector of inputs passed to CSV.File
A Symbol
, String
, or Pair
of Symbol
or String
to Vector
. As a single Symbol
or String
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair
, the 2nd part of the pair should be a Vector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
missingstring
Argument to control how missing
values are handled while parsing input data. The default is missingstring=""
, which means two consecutive delimiters, like ,,
, will result in a cell being set as a missing
value. Otherwise, you can pass a single string to use as a "sentinel", like missingstring="NA"
, or a vector of strings, where each will be checked for when parsing, like missingstring=["NA", "NAN", "NULL"]
, and if any match, the cell will be set to missing
. By passing missingstring=nothing
, no missing
values will be checked for while parsing.
Examples
delim
A Char
or String
argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,
) if no other delimiter can be detected consistently.
Examples
ignorerepeated
A Bool
argument, default false
, that, if set to true
, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.
Examples
quoted
A Bool
argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true
. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false
, so parsing avoids treating the quotechar
or openquotechar
/closequotechar
arguments specially.
Examples
quotechar
/ openquotechar
/ closequotechar
An ASCII Char
argument (or arguments if both openquotechar
and closequotechar
are provided) that parsing uses to handle "quoted" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar
, or start with openquotechar
and end with closequotechar
so parsing knows to treat the delim
or newline as part of the cell value instead of as significant parsing characters. If the quotechar
or closequotechar
characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.
Examples
escapechar
An ASCII Char
argument that parsing uses when parsing quoted cells and the quotechar
or closequotechar
characters appear in a cell string value. If the escapechar
character is encountered inside a quoted cell, it will be "skipped", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar
is not included in the value of the cell, but is ignored completely.
dateformat
A String
or AbstractDict
argument that controls how parsing detects datetime values in the data input. As a single String
(or DateFormat
) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time
, Date
, or DateTime
). By default, if no dateformat
argument is explicitly provided, parsing will try to detect any of Time
, Date
, or DateTime
types following the standard Dates.ISOTimeFormat
, Dates.ISODateFormat
, or Dates.ISODateTimeFormat
formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat
format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing
value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict
is provided, different dateformat
strings can be provided for specific columns; the provided dict can map either an Integer
for column number or a String
, Symbol
or Regex
for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
Examples
decimal
An ASCII Char
argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14
, the '.'
character separates the 3
and 14
values, whereas for 3,14
(common European notation), the ','
character separates the fractional portion. By default, decimal='.'
.
Examples
groupmark
/ thousands separator
A "groupmark" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark
, if provided, must be an ASCII Char
which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729
is ','
and the groupmark for the US social security number 875-39-3196
is -
. By default, groupmark=nothing
which indicates that there are no stray characters separating digits.
Examples
truestrings
/ falsestrings
These arguments can be provided as Vector{String}
to specify custom values that should be treated as the Bool
true
/false
values for all the columns of a data input. By default, ["true", "True", "TRUE", "T", "1"]
string values are used to detect true
values, and ["false", "False", "FALSE", "F", "0"]
string values are used to detect false
values. Note that even though "1"
and "0"
can be used to parse true
/false
values, in terms of auto detecting column types, those values will be parsed as Int64
first, instead of Bool
. To instead parse those values as Bool
s for a column, you can manually provide that column's type as Bool
(see the type argument).
Examples
types
Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type
, an AbstractVector
of types, an AbstractDict
, or a function.
- If a single type is provided, like
types=Float64
, then all columns in the data input will be parsed asFloat64
. If a column's value isn't a validFloat64
value, then a warning will be emitted, unlesssilencewarnings=false
is passed, then no warning will be printed. However, ifstrict=true
is passed, then an error will be thrown instead, regarldess of thesilencewarnings
argument. - If a
AbstractVector{Type}
is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order. - If an
AbstractDict
, then specific columns can have their column type specified with the key of the dict being anInteger
for column number, orString
orSymbol
for column name orRegex
matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing. - If a function, then it should be of the form
(i, name) -> Union{T, Nothing}
, and will be applied to each detected column during initial parsing. Returningnothing
from the function will result in the column's type being automatically detected during parsing.
By default types=nothing
, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing}
if the data input contains missing
values. Parsing will detect missing
values if present, and promote any manually provided column types from the singular (Float64
) to the missing equivalent (Union{Float64, Missing}
) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64
, Float64
, Date
, DateTime
, Time
, Bool
, String
.
Non-standard types can be provided, like Dec64
from the DecFP.jl package, but must support the Base.tryparse(T, str)
function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end
, as long as a corresponding Base.tryparse
definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';')))
, where a single cell in the data input is like 1.23;4.56;7.89
.
Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString)
, where the first column will be parsed as a String
, while any other string columns will have the PosLenString
type.
Examples
typemap
An AbstractDict{Type, Type}
argument that allows replacing a non-String
standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64
, like typemap=IdDict(Int64 => Float64)
, which would cause any columns detected as Int64
to be parsed as Float64
instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String)
, which will cause any columns detected as Date
to be parsed as String
instead.
Examples
pool
Argument that controls whether columns will be returned as PooledArray
s. Can be provided as a Bool
, Float64
, Tuple{Float64, Int}
, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}
. As a Bool
, controls absolutely whether a column will be pooled or not; if passed as a single Bool
argument like pool=true
, then all string columns will be pooled, regardless of cardinality. When passed as a Float64
, the value should be between 0.0
and 1.0
to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1
, then all string columns with a unique value % less than 10% will be returned as PooledArray
, while other string columns will be normal string vectors. If pool
is provided as a tuple, like (0.2, 500)
, the first tuple element is the same as a single Float64
value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500)
means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool
argument is a single Bool
, Real
, or Tuple{Float64, Int}
, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool
, Float64
, or Tuple{Float64, Int}
. Similar to the types argument, providing a vector to pool
should have an element for each column in the data input, while a dict argument can map column number/name to Bool
, Float64
, or Tuple{Float64, Int}
for specific columns. Unspecified columns will not be pooled when the argument is a dict.
Examples
downcast
A Bool
argument that controls whether Integer
detected column types will be "shrunk" to the smallest possible integer type. Argument is false
by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64
, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.
stringtype
An argument that controls the precise type of string columns. Supported values are InlineString
(the default), PosLenString
, or String
. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:
InlineString
: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction thanString
. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.PosLenString
: results in columns returned asPosLenStringVector
(orChainedVector{PosLenStringVector}
for the multithreaded case), which holds a reference to the original input data, and acts as one large "view" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns.PosLenStringVector
, however, does not support traditional mutable operations like regularVector
s, likepush!
,append!
, ordeleteat!
.String
: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normalVector{String}
(orChainedVector{Vector{String}}
), which can be processed normally, including any mutating operations.
strict
/ silencewarnings
/ maxwarnings
Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing
, emit a warning (i.e. silencewarnings=false
and strict=false
), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100
). If strict=true
, then invalid values will result in an error being thrown instead of any warnings emitted.
debug
A Bool
argument that controls the printing of extra "debug" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.
API Reference
CSV.read
— FunctionCSV.read(source, sink::T; kwargs...)
=> T
Read and parses a delimited file or files, materializing directly using the sink
function. Allows avoiding excessive copies of columns for certain sinks like DataFrame
.
Example
julia> using CSV, DataFrames
julia> path = tempname();
julia> write(path, "a,b,c\n1,2,3");
julia> CSV.read(path, DataFrame)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> CSV.read(path, DataFrame; header=false)
2×3 DataFrame
Row │ Column1 Column2 Column3
│ String1 String1 String1
─────┼───────────────────────────
1 │ a b c
2 │ 1 2 3
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.File
— TypeCSV.File(input; kwargs...) => CSV.File
Read a UTF-8 CSV input and return a CSV.File
object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...)
instead if the CSV.File
intermediate object isn't needed.
The input
argument can be one of:
- filename given as a string or FilePaths.jl type
- a
Vector{UInt8}
orSubArray{UInt8, 1, Vector{UInt8}}
byte buffer - a
CodeUnits
object, which wraps aString
, likecodeunits(str)
- a csv-formatted string can also be passed like
IOBuffer(str)
- a
Cmd
or otherIO
- a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
- a
Vector
of any of the above, which will parse and vertically concatenate each source, returning a single, "long"CSV.File
To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response
body can be passed like:
using Downloads, CSV
f = CSV.File(Downloads.download(url))
# or
using HTTP, CSV
f = CSV.File(HTTP.get(url).body)
Opens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types
keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector
for all columns, or specified per column via name or index in a Dict
).
When a Vector
of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVector
s to lazily concatenate each thread's columns.
For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1"))
.
The returned CSV.File
object supports the Tables.jl interface and can iterate CSV.Row
s. CSV.Row
supports propertynames
and getproperty
to access individual row values. CSV.File
also supports entire column access like a DataFrame
via direct property access on the file object, like f = CSV.File(file); f.col1
. Or by getindex access with column names, like f[:col1]
or f["col1"]
. The returned columns are AbstractArray
subtypes, including: SentinelVector
(for integers), regular Vector
, PooledVector
for pooled columns, MissingVector
for columns of all missing
values, PosLenStringVector
when stringtype=PosLenString
is passed, and ChainedVector
will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single "chain" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a
will become a_1
). For example, one could iterate over a csv file with column names a
, b
, and c
by doing:
for row in CSV.File(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
By supporting the Tables.jl interface, a CSV.File
can also be a table input to any other table sink function. Like:
# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame
# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)
# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.Chunks
— TypeCSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File
, see those docs for explanations of each keyword argument.
The ntasks
keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS
environment variable) or 8 if Julia is run single-threaded.
Each iteration of CSV.Chunks
produces the next chunk of a file as a CSV.File
. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File
, types may be passed manually via the type
or types
keyword arguments.
This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
CSV.Rows
— TypeCSV.Rows(source; kwargs...) => CSV.Rows
Read a csv input returning a CSV.Rows
object.
The input
argument can be one of:
- filename given as a string or FilePaths.jl type
- a
Vector{UInt8}
orSubArray{UInt8, 1, Vector{UInt8}}
byte buffer - a
CodeUnits
object, which wraps aString
, likecodeunits(str)
- a csv-formatted string can also be passed like
IOBuffer(str)
- a
Cmd
or otherIO
- a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response
body can be passed like:
f = CSV.Rows(HTTP.get(url).body)
For other IO
or Cmd
inputs, you can pass them like: f = CSV.Rows(read(obj))
.
While similar to CSV.File
, CSV.Rows
provides a slightly different interface, the tradeoffs including:
- Very minimal memory footprint; while iterating, only the current row values are buffered
- Only provides row access via iteration; to access columns, one can stream the rows into a table type
- Performs no type inference; each column/cell is essentially treated as
Union{String, Missing}
, users can utilize the performantParsers.parse(T, str)
to convert values to a more specific type if needed, or pass types upon construction using thetype
ortypes
keyword arguments
Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String
unless otherwise manually provided). The returned CSV.Rows
object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames
, getproperty
, and getindex
to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a
will become a_1
). For example, one could iterate over a csv file with column names a
, b
, and c
by doing:
for row in CSV.Rows(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
Arguments
File layout options:
header=1
: how column names should be determined; if given as anInteger
, indicates the row to parse for column names; as anAbstractVector{<:Integer}
, indicates a set of rows to be concatenated together as column names;Vector{Symbol}
orVector{String}
give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as aVector
, or setheader=0
orheader=false
and column names will be auto-generated (Column1
,Column2
, etc.). Note that if a row number header andcomment
orignoreemptyrows
are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.normalizenames::Bool=false
: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1
getproperty
syntax or iterating rows and accessing column values of a row viagetproperty
(e.g.row.col1
)skipto::Integer
: specifies the row where the data starts in the csv file; by default, the next row after theheader
row(s) is used. Ifheader=0
, then the 1st row is assumed to be the start of data; providing askipto
argument does not affect theheader
argument. Note that if a row numberskipto
andcomment
orignoreemptyrows
are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.footerskip::Integer
: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecomment
keyword argument) do not count towards the row number provided forfooterskip
, they are completely ignored by the parsertranspose::Bool
: read a csv file "transposed", i.e. each column is parsed as a rowcomment::String
: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskipto
andcomment
are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.ignoreemptyrows::Bool=true
: whether empty rows in a file should be ignored (iffalse
, each column will be assignedmissing
for that empty row)select
: anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "selector" function of the form(i, name) -> keep::Bool
; only columns in the collection or for which the selector function returnstrue
will be parsed and accessible in the resultingCSV.File
. Invalid values inselect
are ignored.drop
: inverse ofselect
; anAbstractVector
ofInteger
,Symbol
,String
, orBool
, or a "drop" function of the form(i, name) -> drop::Bool
; columns in the collection or for which the drop function returnstrue
will ignored in the resultingCSV.File
. Invalid values indrop
are ignored.limit
: anInteger
to indicate a limited number of rows to parse in a csv file; use in combination withskipto
to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimit
argument may not result in an exact # of rows parsed; usentasks=1
to ensure an exact limit if necessarybuffer_in_memory
: aBool
, defaultfalse
, which controls whether aCmd
,IO
, or gzipped source will be read/decompressed in memory vs. using a temporary file.ntasks::Integer=Threads.nthreads()
: [not applicable toCSV.Rows
] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADS
environment variable orjulia -t N
); settingntasks=1
will avoid any calls toThreads.@spawn
and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)rows_to_check::Integer=30
: [not applicable toCSV.Rows
] a multithreaded parsed file will be split up intontasks
# of equal chunks;rows_to_check
controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_check
may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rowssource
: [only applicable for vector of inputs toCSV.File
] aSymbol
,String
, orPair
ofSymbol
orString
toVector
. As a singleSymbol
orString
, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair
, the 2nd part of the pair should be aVector
of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
missingstring
: either anothing
,String
, orVector{String}
to use as sentinel values that will be parsed asmissing
; ifnothing
is passed, no sentinel/missing values will be parsed; by default,missingstring=""
, which means only an empty field (two consecutive delimiters) is consideredmissing
delim=','
: aChar
orString
that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the fileignorerepeated::Bool=false
: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cellsquoted::Bool=true
: whether parsing should check forquotechar
at the start/end of cellsquotechar='"'
,openquotechar
,closequotechar
: aChar
(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline charactersescapechar='"'
: theChar
used to escape quote characters in a quoted fielddateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}
: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict
, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt
, or nameSymbol
orString
to the format string for that column.decimal='.'
: aChar
indicating how decimals are separated in floats, i.e.3.14
uses'.'
, or3,14
uses a comma','
groupmark=nothing
: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00
).truestrings
,falsestrings
:Vector{String}
s that indicate howtrue
orfalse
values are represented; by default"true", "True", "TRUE", "T", "1"
are used to detecttrue
and"false", "False", "FALSE", "F", "0"
are used to detectfalse
; note that columns with only1
and0
values will default toInt64
column type unless explicitly requested to beBool
viatypes
keyword argumentstripwhitespace=false
: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
types
: a singleType
,AbstractVector
orAbstractDict
of types, or a function of the form(i, name) -> Union{T, Nothing}
to be used for column types; if a singleType
is provided, all columns will be parsed with that single type; anAbstractDict
can map column indexInteger
, or nameSymbol
orString
to type for a column, i.e.Dict(1=>Float64)
will set the first column as aFloat64
,Dict(:column1=>Float64)
will set the column namedcolumn1
toFloat64
and,Dict("column1"=>Float64)
will set thecolumn1
toFloat64
; if aVector
is provided, it must match the # of columns provided or detected inheader
. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothing
to signal the column's type should be detected while parsing.typemap::IdDict{Type, Type}
: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)
would change every detectedFloat64
column to be parsed asString
; only "standard" types are allowed to be mapped to another type, i.e.Int64
,Float64
,Date
,DateTime
,Time
, andBool
. If a column of one of those types is "detected", it will be mapped to the specified type.pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500)
: [not supported byCSV.Rows
] controls whether columns will be built asPooledArray
; iftrue
, all columns detected asString
will be pooled; alternatively, the proportion of unique values below whichString
columns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25
, it will be pooled). If provided as aTuple{Float64, Int}
like(0.2, 500)
, it represents the percent cardinality threshold as the 1st tuple element (0.2
), and an upper limit for the # of unique values (500
), under which the column will be pooled; this is the default (pool=(0.2, 500)
). If anAbstractVector
, each element should beBool
,Real
, orTuple{Float64, Int}
and the # of elements should match the # of columns in the dataset; if anAbstractDict
, aBool
,Real
, orTuple{Float64, Int}
value can be provided for individual columns where the dict key is given as column indexInteger
, or column name asSymbol
orString
. If a function is provided, it should take a column index and name as 2 arguments, and return aBool
,Real
,Tuple{Float64, Int}
, ornothing
for each column.downcast::Bool=false
: controls whether columns detected asInt64
will be "downcast" to the smallest possible integer type likeInt8
,Int16
,Int32
, etc.stringtype=InlineStrings.InlineString
: controls how detected string columns will ultimately be returned; default isInlineString
, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString
. IfString
is passed, all string columns will just be normalString
values. IfPosLenString
is passed, string columns will be returned asPosLenStringVector
, which is a special "lazy"AbstractVector
that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVector
makes it read-only, so operations likepush!
,append!
, orsetindex!
are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may failstrict::Bool=false
: whether invalid values should throw a parsing error or be replaced withmissing
silencewarnings::Bool=false
: ifstrict=false
, whether invalid value warnings should be silencedmaxwarnings::Int=100
: if more thanmaxwarnings
number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings
debug::Bool=false
: passingtrue
will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsedvalidate::Bool=true
: whether or not to validate that columns specified in thetypes
,dateformat
andpool
keywords are actually found in the data. Iffalse
no validation is done, meaning no error will be thrown iftypes
/dateformat
/pool
specify settings for columns not actually found in the data.
Iteration options:
reusebuffer=false
: [only supported byCSV.Rows
] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doingcollect(CSV.Rows(file))
because only current iterated row is "valid")
Utilities
CSV.detect
— FunctionCSV.detect(str::String)
Use the same logic used by CSV.File
to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows
type, which returns each cell of a file as a String. The order of types attempted is: Int
, Float64
, Date
, DateTime
, Bool
, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options
type as a keyword argument option=ops
for sentinel value detection.
Common terms
Standard types
The types that are detected by default when column types are not provided by the user otherwise. They include: Int64
, Float64
, Date
, DateTime
, Time
, Bool
, and String
.
Newlines
For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\n'
), single return character ('\r'
), or full CRLF sequence ("\r\n"
).
Cardinality
Refers to the ratio of unique values to total number of values in a column. Columns with "low cardinality" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with "high cardinality" have a high % of unique values relative to total number of values. Think of these as "id-like" columns where each or almost each value is a unique identifier with no (or few) repeated values.