Reading

The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).

input

A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.

Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let's look at the various supported input types:

  • File name as a String or FilePath; parsing will call Mmap.mmap(string(file)) to get a byte buffer to the file data. For gzip compressed inputs, like file.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passing buffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO or Vector{UInt8} of decompressed data as input.
  • Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it like CSV.File(IOBuffer(str))
  • IO or Cmd: you can pass an IO or Cmd directly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, pass buffer_in_memory=true.
  • For files from the web, you can call HTTP.get(url).body to request the file, then access the data as a Vector{UInt8} from the body field, which can be passed directly for parsing. For Julia 1.6+, you can also use the Downloads stdlib, like Downloads.download(url) which can be passed to parsing

Examples

The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid augments for header include:

  • Integer, e.g. header=2: provide the row number as an Integer where the column names can be found
  • Bool, e.g. header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, like Column1, Column2, etc.
  • Vector{String} or Vector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of the Vector will be made and converted to Vector{Symbol}
  • AbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names

Examples

normalizenames

Controls whether column names will be "normalized" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren't allowed.

Examples

skipto

An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.

Examples

footerskip

An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted "end of file".

Examples

transpose

If transpose=true is passed, data will be read "transposed", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a "long" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.

Examples

comment

A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as "rows" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.

Examples

ignoreemptyrows

This argument specifies whether "empty rows", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as "rows" when any of the header, skipto, or footerskip arguments are provided.

Examples

select / drop

Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifying the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.

Examples

limit

An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.

Examples

ntasks

NOTE: not applicable to CSV.Rows

For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.

rows_to_check

NOTE: not applicable to CSV.Rows

When input data is large enough, parsing will attempt to "chunk" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk's first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.

source

NOTE: only applicable to vector of inputs passed to CSV.File

A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

missingstring

Argument to control how missing values are handled while parsing input data. The default is missingstring="", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a "sentinel", like missingstring="NA", or a vector of strings, where each will be checked for when parsing, like missingstring=["NA", "NAN", "NULL"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.

Examples

delim

A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.

Examples

ignorerepeated

A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.

Examples

quoted

A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.

Examples

quotechar / openquotechar / closequotechar

An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle "quoted" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.

Examples

escapechar

An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be "skipped", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.

dateformat

A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.

Examples

decimal

An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.

Examples

groupmark / thousands separator

A "groupmark" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.

Examples

truestrings / falsestrings

These arguments can be provided as Vector{String} to specify custom values that should be treated as the Bool true/false values for all the columns of a data input. By default, ["true", "True", "TRUE", "T", "1"] string values are used to detect true values, and ["false", "False", "FALSE", "F", "0"] string values are used to detect false values. Note that even though "1" and "0" can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column's type as Bool (see the type argument).

Examples

types

Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.

  • If a single type is provided, like types=Float64, then all columns in the data input will be parsed as Float64. If a column's value isn't a valid Float64 value, then a warning will be emitted, unless silencewarnings=false is passed, then no warning will be printed. However, if strict=true is passed, then an error will be thrown instead, regarldess of the silencewarnings argument.
  • If a AbstractVector{Type} is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order.
  • If an AbstractDict, then specific columns can have their column type specified with the key of the dict being an Integer for column number, or String or Symbol for column name or Regex matching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing.
  • If a function, then it should be of the form (i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returning nothing from the function will result in the column's type being automatically detected during parsing.

By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn't necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.

Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.

Note that the default stringtype can be overridden by providing a column's type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.

Examples

typemap

An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.

Examples

pool

Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won't. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.

Examples

downcast

A Bool argument that controls whether Integer detected column types will be "shrunk" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.

stringtype

An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here's a quick rundown of the possible options:

  • InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren't reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction than String. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they're promoted to a high fixed-width.
  • PosLenString: results in columns returned as PosLenStringVector (or ChainedVector{PosLenStringVector} for the multithreaded case), which holds a reference to the original input data, and acts as one large "view" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns. PosLenStringVector, however, does not support traditional mutable operations like regular Vectors, like push!, append!, or deleteat!.
  • String: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normal Vector{String} (or ChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.

strict / silencewarnings / maxwarnings

Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they'll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.

debug

A Bool argument that controls the printing of extra "debug" information while parsing. Can be useful if parsing doesn't produce the expected result or a bug is suspected in parsing somehow.

API Reference

CSV.readFunction

CSV.read(source, sink::T; kwargs...) => T

Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.

Example

julia> using CSV, DataFrames

julia> path = tempname();

julia> write(path, "a,b,c\n1,2,3");

julia> CSV.read(path, DataFrame)
1×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> CSV.read(path, DataFrame; header=false)
2×3 DataFrame
 Row │ Column1  Column2  Column3
     │ String1  String1  String1
─────┼───────────────────────────
   1 │ a        b        c
   2 │ 1        2        3

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
  • buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
  • source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

Parsing options:

  • missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
  • stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
  • typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
  • validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.FileType
CSV.File(input; kwargs...) => CSV.File

Read a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn't needed.

The input argument can be one of:

  • filename given as a string or FilePaths.jl type
  • a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
  • a CodeUnits object, which wraps a String, like codeunits(str)
  • a csv-formatted string can also be passed like IOBuffer(str)
  • a Cmd or other IO
  • a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
  • a Vector of any of the above, which will parse and vertically concatenate each source, returning a single, "long" CSV.File

To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:

using Downloads, CSV
f = CSV.File(Downloads.download(url))

# or

using HTTP, CSV
f = CSV.File(HTTP.get(url).body)

Opens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).

When a Vector of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVectors to lazily concatenate each thread's columns.

For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1")).

The returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Or by getindex access with column names, like f[:col1] or f["col1"]. The returned columns are AbstractArray subtypes, including: SentinelVector (for integers), regular Vector, PooledVector for pooled columns, MissingVector for columns of all missing values, PosLenStringVector when stringtype=PosLenString is passed, and ChainedVector will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single "chain" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.File(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

By supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:

# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame

# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)

# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
  • buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
  • source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

Parsing options:

  • missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
  • stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
  • typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
  • validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.ChunksType
CSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks

Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.

The ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.

Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.

This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
  • buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
  • source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

Parsing options:

  • missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
  • stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
  • typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
  • validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source
CSV.RowsType
CSV.Rows(source; kwargs...) => CSV.Rows

Read a csv input returning a CSV.Rows object.

The input argument can be one of:

  • filename given as a string or FilePaths.jl type
  • a Vector{UInt8} or SubArray{UInt8, 1, Vector{UInt8}} byte buffer
  • a CodeUnits object, which wraps a String, like codeunits(str)
  • a csv-formatted string can also be passed like IOBuffer(str)
  • a Cmd or other IO
  • a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

f = CSV.Rows(HTTP.get(url).body)

For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).

While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:

  • Very minimal memory footprint; while iterating, only the current row values are buffered
  • Only provides row access via iteration; to access columns, one can stream the rows into a table type
  • Performs no type inference; each column/cell is essentially treated as Union{String, Missing}, users can utilize the performant Parsers.parse(T, str) to convert values to a more specific type if needed, or pass types upon construction using the type or types keyword arguments

Opens the file and uses passed arguments to detect the number of columns, ***but not*** column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:

for row in CSV.Rows(file)
    println("a=$(row.a), b=$(row.b), c=$(row.c)")
end

Arguments

File layout options:

  • header=1: how column names should be determined; if given as an Integer, indicates the row to parse for column names; as an AbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names; Vector{Symbol} or Vector{String} give column names explicitly (should match # of columns in dataset); if a dataset doesn't have column names, either provide them as a Vector, or set header=0 or header=false and column names will be auto-generated (Column1, Column2, etc.). Note that if a row number header and comment or ignoreemptyrows are provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row.
  • normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using the tbl.col1 getproperty syntax or iterating rows and accessing column values of a row via getproperty (e.g. row.col1)
  • skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after the header row(s) is used. If header=0, then the 1st row is assumed to be the start of data; providing a skipto argument does not affect the header argument. Note that if a row number skipto and comment or ignoreemptyrows are provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row.
  • footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see the comment keyword argument) do not count towards the row number provided for footerskip, they are completely ignored by the parser
  • transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row
  • comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header or skipto and comment are provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row.
  • ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (if false, each column will be assigned missing for that empty row)
  • select: an AbstractVector of Integer, Symbol, String, or Bool, or a "selector" function of the form (i, name) -> keep::Bool; only columns in the collection or for which the selector function returns true will be parsed and accessible in the resulting CSV.File. Invalid values in select are ignored.
  • drop: inverse of select; an AbstractVector of Integer, Symbol, String, or Bool, or a "drop" function of the form (i, name) -> drop::Bool; columns in the collection or for which the drop function returns true will ignored in the resulting CSV.File. Invalid values in drop are ignored.
  • limit: an Integer to indicate a limited number of rows to parse in a csv file; use in combination with skipto to read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, the limit argument may not result in an exact # of rows parsed; use ntasks=1 to ensure an exact limit if necessary
  • buffer_in_memory: a Bool, default false, which controls whether a Cmd, IO, or gzipped source will be read/decompressed in memory vs. using a temporary file.
  • ntasks::Integer=Threads.nthreads(): [not applicable to CSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e. JULIA_NUM_THREADS environment variable or julia -t N); setting ntasks=1 will avoid any calls to Threads.@spawn and just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells)
  • rows_to_check::Integer=30: [not applicable to CSV.Rows] a multithreaded parsed file will be split up into ntasks # of equal chunks; rows_to_check controls the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields, lines_to_check may need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows
  • source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.

Parsing options:

  • missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default, missingstring="", which means only an empty field (two consecutive delimiters) is considered missing
  • delim=',': a Char or String that indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file
  • ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells
  • quoted::Bool=true: whether parsing should check for quotechar at the start/end of cells
  • quotechar='"', openquotechar, closequotechar: a Char (or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters
  • escapechar='"': the Char used to escape quote characters in a quoted field
  • dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as an AbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column index Int, or name Symbol or String to the format string for that column.
  • decimal='.': a Char indicating how decimals are separated in floats, i.e. 3.14 uses '.', or 3,14 uses a comma ','
  • groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00).
  • truestrings, falsestrings: Vector{String}s that indicate how true or false values are represented; by default "true", "True", "TRUE", "T", "1" are used to detect true and "false", "False", "FALSE", "F", "0" are used to detect false; note that columns with only 1 and 0 values will default to Int64 column type unless explicitly requested to be Bool via types keyword argument
  • stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names

Column Type Options:

  • types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column's type should be detected while parsing.
  • typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e. Dict(Float64=>String) would change every detected Float64 column to be parsed as String; only "standard" types are allowed to be mapped to another type, i.e. Int64, Float64, Date, DateTime, Time, and Bool. If a column of one of those types is "detected", it will be mapped to the specified type.
  • pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported by CSV.Rows] controls whether columns will be built as PooledArray; if true, all columns detected as String will be pooled; alternatively, the proportion of unique values below which String columns should be pooled (meaning that if the # of unique strings in a column is under 25%, pool=0.25, it will be pooled). If provided as a Tuple{Float64, Int} like (0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If an AbstractVector, each element should be Bool, Real, or Tuple{Float64, Int} and the # of elements should match the # of columns in the dataset; if an AbstractDict, a Bool, Real, or Tuple{Float64, Int} value can be provided for individual columns where the dict key is given as column index Integer, or column name as Symbol or String. If a function is provided, it should take a column index and name as 2 arguments, and return a Bool, Real, Tuple{Float64, Int}, or nothing for each column.
  • downcast::Bool=false: controls whether columns detected as Int64 will be "downcast" to the smallest possible integer type like Int8, Int16, Int32, etc.
  • stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default is InlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default to String. If String is passed, all string columns will just be normal String values. If PosLenString is passed, string columns will be returned as PosLenStringVector, which is a special "lazy" AbstractVector that acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature of PosLenStringVector makes it read-only, so operations like push!, append!, or setindex! are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail
  • strict::Bool=false: whether invalid values should throw a parsing error or be replaced with missing
  • silencewarnings::Bool=false: if strict=false, whether invalid value warnings should be silenced
  • maxwarnings::Int=100: if more than maxwarnings number of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up to maxwarnings
  • debug::Bool=false: passing true will result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed
  • validate::Bool=true: whether or not to validate that columns specified in the types, dateformat and pool keywords are actually found in the data. If false no validation is done, meaning no error will be thrown if types/dateformat/pool specify settings for columns not actually found in the data.

Iteration options:

  • reusebuffer=false: [only supported by CSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it's not safe to use this option if doing collect(CSV.Rows(file)) because only current iterated row is "valid")
source

Utilities

CSV.detectFunction
CSV.detect(str::String)

Use the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.

source

Common terms

Standard types

The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.

Newlines

For all parsing functionality, newlines are detected/parsed automatically, regardless if they're present in the data as a single newline character ('\n'), single return character ('\r'), or full CRLF sequence ("\r\n").

Cardinality

Refers to the ratio of unique values to total number of values in a column. Columns with "low cardinality" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with "high cardinality" have a high % of unique values relative to total number of values. Think of these as "id-like" columns where each or almost each value is a unique identifier with no (or few) repeated values.