For example, reading 40mb files consisting of 515356 rows and 25 columns can take up to 7 seconds/ 607MB of memory and that’s on Macbook Pro with SSD.

`julia> @benchmark CSV.read(file_path, DataTable, nullable = false, types = data_types)`

BenchmarkTools.Trial:

memory estimate: 607.77 MiB

allocs estimate: 37767548

--------------

minimum time: 4.684 s (0.79% GC)

median time: 5.032 s (2.04% GC)

mean time: 5.032 s (2.04% GC)

maximum time: 5.379 s (3.13% GC)

--------------

samples: 2

evals/sample: 1

So what’s the solution?

The solution would be to save and read files from feather. Feather was designed by Apache to be a very fast file format for storing data frames.

**What is Feather?**

Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:

- Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
- Language agnostic: Feather files are the same whether written by Julia, Python or R code.
- High read and write performance. When possible, Feather operations should be bound by local disk performance.

**How good is it?**

The same CSV file saved as feather and then read in Julia. According to the results below Feather performed 25 times faster.

`julia> @benchmark Feather.read("opens/data/data.contacts.feather")`

BenchmarkTools.Trial:

memory estimate: 43.96 MiB

allocs estimate: 4860

--------------

minimum time: 47.767 ms (1.92% GC)

median time: 73.757 ms (22.80% GC)

mean time: 118.652 ms (52.99% GC)

maximum time: 183.505 ms (68.02% GC)

--------------

samples: 43

evals/sample: 1

No-brainer I will use it next time I will have to save files.

The package is registered in METADATA.jl and so can be installed with Pkg.add.

]]>

Whenever you just start with Julia every book suggests you to forget about data type annotation and let compiler decide and define it. For example:

Yes, it works great… in 99% of the scenarios. I followed the same approach until I noticed an enormous memory consumption when working with large datasets.

Today I would like to show how using the right data type can go along way toward minimising problems, optimising performance and reducing memory consumption.

So the reason for enormously high memory consumption is the number of bits required to store your value. I suspect you are using 64 bit system. Int64 is standard for Integers.

Type | Signed? | Number of bits | Smallest value | Largest value |
---|---|---|---|---|

Int8 | ✓ | 8 | -2^7 | 2^7 – 1 |

UInt8 | 8 | 0 | 2^8 – 1 | |

Int16 | ✓ | 16 | -2^15 | 2^15 – 1 |

UInt16 | 16 | 0 | 2^16 – 1 | |

Int32 | ✓ | 32 | -2^31 | 2^31 – 1 |

UInt32 | 32 | 0 | 2^32 – 1 | |

Int64 |
✓ |
64 |
-2^63 |
2^63 – 1 |

UInt64 | 64 | 0 | 2^64 – 1 | |

Int128 | ✓ | 128 | -2^127 | 2^127 – 1 |

UInt128 | 128 | 0 | 2^128 – 1 | |

Bool | N/A | 8 | false (0) | true (1) |

Float16 | N/A | 16 | N/A | N/A |

Float32 | N/A | 32 | N/A | N/A |

Float64 | N/A | 64 | N/A | N/A |

So the example we had above could be easily defined as an array of Int8 and we would have saved 56*4 bits in total:

`a = Int8[1, 2, 3, 4]`

But does it really matter that much? I will run some tests below to prove that choosing the appropriate data type matters.

We will randomly generate integers from 0 to 100 and store them as Int64, Float32, Int8 and compare benchmarks.

*NB: The script below runs in a global scope and that can affect the results.*

From the Git below you can see that both speed and memory consumption are highly linked to the data types. **There is over 60MB difference between Int64 and Int8!**

Now imagine having multi-dimensional array. What would be the difference there?

We have chosen Julia for its speed being close to C/ Fortran so lets be careful.

]]>

The following topics were covered in the book and classes

- The Product and Quotient Rules
- The Chain Rule and the General Power Rule
- Implicit Differentiation and Related Rates

This time I decided not to write a long story on how to work with derivatives but instead to link to Khan Academy. Sal from Khan explains Product/ Quotient and other topics in great details.

]]>

- Describing Graphs of Functions
- The First- and Second-Derivative Rules
- The First- and Second-Derivative Tests and Curve Sketching
- Curve Sketching (Conclusion)
- Optimization Problems
- Further Optimization Problems
- Applications of Derivatives to Business and Economics

And as always, a little bit of notes/ theory from the topics learned.

**First-Derivative Rule**

- If f′(a) > 0, then f(x) is increasing at x = a. If f′(a) < 0, then f(x) is decreasing at x = a.
- If f′(a) = 0, the function might be increasing or decreasing or have a relative extreme point at x = a.

**Second-Derivative Rule **

- If f′′(a) > 0, then f(x) is concave up at x = a. If f′′(a) < 0, then f(x) is concave down at x = a.

**What’s Concave Up or Down?**

**And combined results**

**The First-Derivative Test** (for local extreme points) Suppose that f′(a) = 0.

- If f′ changes from positive to negative at x = a, then f has a local maximum at a.
- If f′ changes from negative to positive at x = a, then f has a local minimum at a.
- If f′ does not change sign at a, then f has no local extremum at a.

**The Second-Derivative Test** (for local extreme points)

- If f′(a) = 0 and f′′(a) < 0, then f has a local maximum at a.
- If f′(a) = 0 and f′′(a) > 0, then f has a local minimum at a.

I wont cover the topic on how it can be used in business, but I really suggest you to go through the chapter yourself to get the understanding.

A great chance to practice your skills is to complete interactive exercises on Khan.

]]>

The following topics were covered by Chapter 1:

- The Slope of a Straight Line
- The Slope of a Curve at a Point
- The Derivative and Limits
- Limits and the Derivative
- Differentiability and Continuity
- Some Rules for Differentiation
- More About Derivatives
- The Derivative as a Rate of Change

So far so good but I guess my pace is to high right now. I will spend some time going through the book once again just in case I missed anything.

~450 pages left to finish. I might slow down a little bit but hope to be done with the book by the end of next week.

A little bit of theory. Lets start with **The Slope of a Straight Line**:

We can compute the slope of a line by knowing two points on the line. If (x1,y1) and (x2,y2) are on the line, the slope of the line (m) is:

`m = (y2 − y1) / (x2 −x1)`

**The slope of a curve at a point P** is defined to be the slope of the tangent line to the curve at P and follows the slope formula

` slope of the graph of y = x^2 at the point (x,y) = 2x`

**The slope formula** that gives the **slope of the curve y = f(x) at any point** is called the **derivative** of f(x) and is written f′(x). In other words – the derivative f′(a) **measures the rate of change** of f(x) at x = a.

There are 3 rules we need to remember:

- If f(x) = mx + b, then we have f′(x) = m
- The derivative of a constant function f(x) = b is zero. That is, f′(x) = 0
- Power rule: let r be any number and let f(x) = x^r. Then f′(x) = rx^(r−1)

Examples of **Power rule**

- If f (x) = x^2 , then its derivative is the function 2x. That is, f′(x) = 2x
- If f (x) = x^3 , then the derivative is 3x^2 . That is, f′(x) = 3x^2
- If f (x) = 1, then f′(x)=−1 (x̸0)

And now we can also write **Equation of the Tangent Line**

`y − f(a) = f′(a)(x − a)`

Another important thing to remember is that on the tangent line, **the change in y, for one unit change in x, is equal to the slope f′(a)**

`f(a + 1) − f(a) ≈ f′(a) OR f(a + 1) ≈ f(a) + f′(a) `

And to summarise everything

]]>

It consisted purely from a school math and precisely

- Functions and Their Graphs
- Some Important Functions
- The Algebra of Functions
- Zeros of Functions—The Quadratic Formula and Factoring
- Exponents and Power Functions
- Functions and Graphs in Applications

I am using most of the material on my daily basis, but it was anyways great to refresh the memory.

The book itself is amazing with great examples and solutions throughout each chapter. I had pleasure scrolling the pages

The most interesting part so far has been Compound Interest and formula which you can find below.

When money is deposited in a savings account, interest is paid at stated intervals. If this interest is added to the account and thereafter earns interest itself, then the interest is called

compound interest. The original amount deposited is called theprincipal amount. The principal amount plus the compound interest is called the com- pound amount. The interval between interest payments is referred to as theinterest period. In formulas for compound interest, the interest rate is expressed as a decimal rather than a percentage. Thus, 6% is written as .06.

If $1000 is deposited at 6% annual interest, compounded annually, the compound amount at the end of the 3rd year will be:

A = 1000(1 + .06)^3ORA = P * (1 + i)^n

I hope to follow the same pace and complete Chapter by Chapter. That’s all for today!

]]>

My name is Dmitry and I have set target to improve my statistics and machine learning skills.

Throughout the year I will be completing different MOOCs from Carnegie Mellon UC, participate in machine learning competitions and contribute to Julia Lang. You will have a chance to follow my progress directly here.

Carnegie Mellon UC provides access to Syllabus and course materials. On top of that I was lucky to find a page describing the full path becoming a Machine Learning Expert.

I will start with basics Calculus 1, 21-111. The textbook for this course is Brief Calculus & Its Applications by Larry J. Goldstein, David C. Lay, and David I. Schneider.

Major Requirements

Theory Requirements |
|||

Course Topic/Title | Course Number | Units | Prerequisites |
---|---|---|---|

Calculus | 21-111 and 112, or 21-120 | 20 or 10 | |

Integration and Approximation | 21-122 | 10 | 21-112 or 21-120 |

Multivariate Calc/Analysis | 21-256, 21-259, or 21-268 | 9–10 | 21-112 or 21-120 |

Concepts of Mathematics | 21-127 | 10 | |

Linear/Matrix Algebra | 21-240, 21-241, or 21-242 | 10 | |

Probability | 36-217, 21-325, 15-359, or 36-225 | 9 | 21-112, 21-122, 21-123, 21-256, or 21-259 |

Statistical Inference | 36-226 or 36-326 | 9 | C or higher in 36-217, 36-225, 21-325, or 15-359 |

Data-Analysis Requirements (Option 1) |
|||

Course Topic/Title | Course Number | Units | Prerequisites |

Beginning Data Analysis | 36-201, 36-220, or 36-247 | 9 | |

Intermediate Data Analysis | 36-202, 36-208, or 36-309 | 9 | various |

Advanced Elective | 36-315, 36-303, 36-490, or 36-46x | 9 | various |

Advanced Elective | 36-315, 36-303, 36-490, or 36-46x | 9 | various |

Modern Regression | 36-401 | 9 | C or higher in 36-226, 36-326, or 36-625 and pass 21-240 or 21-241 |

Advanced Methods for Data Analysis | 36-402 | 9 | C or higher in 36-401 |

Data-Analysis Requirements (Option 2) |
|||

Course Topic/Title | Course Number | Units | Prerequisites |

Advanced Elective | 36-315, 36-303, 36-490, or 36-46x | 9 | various |

Advanced Elective | 36-315, 36-303, 36-490, or 36-46x | 9 | various |

Advanced Elective | 36-315, 36-303, 36-490, or 36-46x | 9 | various |

Modern Regression | 36-401 | 9 | C or higher in 36-226, 36-326, or 36-625 and pass 21-240 or 21-241 |

Advanced Methods for Data Analysis | 36-402 | 9 | C or higher in 36-401 |

Computing Requirements |
|||

Course Topic/Title | Course Number | Units | Prerequisites |

Statistical Computing | 36-350 or 36-650/750 | 9 | 36-202, 36-208, 36-309, 70-208, or equivalent |

Fundamentals of Programming | 15-112 | 12 | |

Principles of Iterative Computation | 15-122 | 10 | C or higher in 15-112 |

Machine Learning | 10-401/601/701 | 12 | C or higher in (15-122 or 15-123) and (15-151 or 21-127) and (36-217 or 36-225 or 21-325 or 15-359) |

Algorithms and Advanced Data Structures | 15-351/451 | 12 | 15-111, 15-123, 15-121, or 15-122 |

Large Data Sets | 10-405/605 or Advanced Machine Learning Elective (10-605, 15-381, 15-386, 16-720, 16-311, 11-411, or 11-761) |

Fingers crossed and lets get started!

]]>