CSV.jl Documentation
GitHub Repo: https://github.com/JuliaData/CSV.jl
Welcome to CSV.jl! A pure-Julia package for handling delimited text data, be it comma-delimited (csv), tab-delimited (tsv), or otherwise.
Installation
You can install CSV by typing the following in the Julia REPL:
] add CSV
followed by
using CSV
to load the package.
Overview
To start out, let's discuss the high-level functionality provided by the package, which hopefully will help direct you to more specific documentation for your use-case:
CSV.File
: the most commonly used function for ingesting delimited data; will read an entire data input or vector of data inputs, detecting number of columns and rows, along with the type of data for each column. Returns aCSV.File
object, which is like a lightweight table/DataFrame. Assumingfile
is a variable of aCSV.File
object, individual columns can be accessed likefile.col1
,file[:col1]
, orfile["col"]
. You can see parsed column names viafile.names
. ACSV.File
can also be iterated, where aCSV.Row
is produced on each iteration, which allows access to each value in the row viarow.col1
,row[:col1]
, orrow[1]
. You can also index aCSV.File
directly, likefile[1]
to return the entireCSV.Row
at the provided index/row number. Multiple threads will be used while parsing the input data if the input is large enough, and full return column buffers to hold the parsed data will be allocated.CSV.File
satisfies the Tables.jl "source" interface, and so can be passed to valid sink functions likeDataFrame
,SQLite.load!
,Arrow.write
, etc. Supports a number of keyword arguments to control parsing, column type, and other file metadata options.CSV.read
: a convenience function identical toCSV.File
, but used when aCSV.File
will be passed directly to a sink function, like aDataFrame
. In some cases, sinks may make copies of incoming data for their own safety; by callingCSV.read(file, DataFrame)
, no copies of the parsedCSV.File
will be made, and theDataFrame
will take direct ownership of theCSV.File
's columns, which is more efficient than doingCSV.File(file) |> DataFrame
which will result in an extra copy of each column being made. Keyword arguments are identical toCSV.File
. Any valid Tables.jl sink function/table type can be passed as the 2nd argument. LikeCSV.File
, a vector of data inputs can be passed as the 1st argument, which will result in a single "long" table of all the inputs vertically concatenated. Each input must have identical schemas (column names and types).CSV.Rows
: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footprint thanCSV.File
. Supports many of the same options asCSV.File
, except column type handling is a little different. By default, every column type will be essentiallyUnion{Missing, String}
, i.e. no automatic type detection is done, but column types can be provided manually. Multithreading is not used while parsing. After constructing aCSV.Rows
object, rows can be "streamed" by iterating, where each iteration produces aCSV.Row2
object, which operates similar toCSV.File
'sCSV.Row
type where individual row values can be accessed viarow.col1
,row[:col1]
, orrow[1]
. If each row is processed individually, additional memory can be saved by passingreusebuffer=true
, which means a single buffer will be allocated to hold the values of only the currently iterated row.CSV.Rows
also supports the Tables.jl interface and can also be passed to valid sink functions.CSV.Chunks
: similar toCSV.File
, but allows passing antasks::Integer
keyword argument which will cause the input file to be "chunked" up intontasks
number of chunks. After constructing aCSV.Chunks
object, each iteration of the object will return aCSV.File
of the next parsed chunk. Useful for processing extremely large files in "chunks". Because each iterated element is a valid Tables.jl "source",CSV.Chunks
satisfies theTables.partitions
interface, so sinks that can process input partitions can operate by passingCSV.Chunks
as the "source".CSV.write
: A valid Tables.jl "sink" function for writing any valid input table out in a delimited text format. Supports many options for controlling the output like delimiter, quote characters, etc. Writes data to an internal buffer, which is flushed out when full, buffer size is configurable. Also supports writing out partitioned inputs as separate output files, one file per input partition. To write out aDataFrame
, for example, it's simplyCSV.write("data.csv", df)
, or to write out a matrix, it'susing Tables; CSV.write("data.csv", Tables.table(mat))
CSV.RowWriter
: An alternative way to produce csv output; takes any valid Tables.jl input, and on each iteration, produces a single csv-formatted string from the input table's row.
That's quite a bit! Let's boil down a TL;DR:
- Just want to read a delimited file or collection of files and do basic stuff with data? Use
CSV.File(file)
orCSV.read(file, DataFrame)
- Don't need the data as a whole or want to stream through a large file row-by-row? Use
CSV.Rows
. - Want to process a large file in "batches"/chunks? Use
CSV.Chunks
. - Need to produce a csv? Use
CSV.write
. - Want to iterate an input table and produce a single csv string per row?
CSV.RowWriter
.
For the rest of the manual, we're going to have two big sections, Reading and Writing where we'll walk through the various options to CSV.File
/CSV.read
/CSV.Rows
/CSV.Chunks
and CSV.write
/CSV.RowWriter
.
- Reading
input
header
normalizenames
skipto
footerskip
transpose
comment
ignoreemptyrows
select
/drop
limit
ntasks
rows_to_check
source
missingstring
delim
ignorerepeated
quoted
quotechar
/openquotechar
/closequotechar
escapechar
dateformat
decimal
groupmark
/ thousands separatortruestrings
/falsestrings
types
typemap
pool
downcast
stringtype
strict
/silencewarnings
/maxwarnings
debug
- API Reference
- Common terms
- Writing
- Examples
- Non-UTF-8 character encodings
- Concatenate multiple inputs at once
- Gzipped input
- Delimited data in a string
- Data from the web/a url
- Reading from a zip file
- Column names on 2nd row
- No column names in data
- Manually provide column names
- Multi-row column names
- Normalizing column names
- Skip to specific row where data starts
- Skipping trailing useless rows
- Reading transposed data
- Ignoring commented rows
- Ignoring empty rows
- Including/excluding columns
- Limiting number of rows from data
- Specifying custom missing strings
- String delimiter
- Fixed width files
- Turning off quoted cell parsing
- Quoted & escaped fields
- DateFormat
- Custom decimal separator
- Thousands separator
- Custom groupmarks
- Custom bool strings
- Matrix-like Data
- Providing types
- Typemap
- Pooled values
- Non-string pooled values
- Pool with absolute threshold