Most probably you have already noticed that reading CSV files in Julia is slow as hell. The process is much more slower than R readr::read_csv or the latest version of data.table.
For example, reading 40mb files consisting of 515356 rows and 25 columns can take up to 7 seconds/ 607MB of memory and that’s on Macbook Pro with SSD.
julia> @benchmark CSV.read(file_path, DataTable, nullable = false, types = data_types)
memory estimate: 607.77 MiB
allocs estimate: 37767548
minimum time: 4.684 s (0.79% GC)
median time: 5.032 s (2.04% GC)
mean time: 5.032 s (2.04% GC)
maximum time: 5.379 s (3.13% GC)
So what’s the solution?
The solution would be to save and read files from feather. Feather was designed by Apache to be a very fast file format for storing data frames.
What is Feather?
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:
- Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
- Language agnostic: Feather files are the same whether written by Julia, Python or R code.
- High read and write performance. When possible, Feather operations should be bound by local disk performance.
How good is it?
The same CSV file saved as feather and then read in Julia. According to the results below Feather performed 25 times faster.
julia> @benchmark Feather.read("opens/data/data.contacts.feather")
memory estimate: 43.96 MiB
allocs estimate: 4860
minimum time: 47.767 ms (1.92% GC)
median time: 73.757 ms (22.80% GC)
mean time: 118.652 ms (52.99% GC)
maximum time: 183.505 ms (68.02% GC)
No-brainer I will use it next time I will have to save files.
The package is registered in METADATA.jl and so can be installed with Pkg.add.