Julia: switch slow CSV in favor of Feather

Most probably you have already noticed that reading CSV files in Julia is slow as hell. The process is much more slower than R readr::read_csv or the latest version of data.table.

For example, reading 40mb files consisting of 515356 rows and 25 columns can take up to 7 seconds/ 607MB of memory and that’s on Macbook Pro with SSD.

julia> @benchmark CSV.read(file_path, DataTable, nullable = false, types = data_types)
memory estimate: 607.77 MiB
allocs estimate: 37767548
minimum time: 4.684 s (0.79% GC)
median time: 5.032 s (2.04% GC)
mean time: 5.032 s (2.04% GC)
maximum time: 5.379 s (3.13% GC)
samples: 2
evals/sample: 1

So what’s the solution?

Continue reading

Julia: getting the best performance out of your data types

Anyone using Julia?

Whenever you just start with Julia every book suggests you to forget about data type annotation and let compiler decide and define it. For example:

Yes, it works great… in 99% of the scenarios. I followed the same approach until I noticed an enormous memory consumption when working with large datasets.

Today I would like to show how using the right data type can go along way toward minimising problems, optimising performance and reducing memory consumption.

Continue reading