profit curve | Jane Huston

At this point, I am officially 1/3 of the way through the Galvanize Data Science Immersive. It’s amazing to think about how much I’ve learned in just a few weeks.My programming skills are certainly leaps and bounds above where they were when I started, in large part due to spending hours coding each and every day. Practice really does pay off!

We spent this week building on the algorithms learned last week (mainly decision trees). We learned how to make better predictions by combining multiple models into what are called ensemble methods: Random Forests and bagged or boosted trees. While I won’t delve into details at this point, the big picture is adding up so-called “weak learners” (that is, models that are only slightly better than random guessing) allows you to emerge with a better-performing predictive model.

A piece of advice I’ve gleaned, relevant for anyone interested in getting into data science, is that a solid understanding of linear algebra will help you when it comes to implementing machine learning algorithms. Thinking about the shape of your data at every step can save you a lot of painful debugging. You can of course use existing Python libraries like scikit-learn that will take care of much of this for you, but to really understand what’s going on under the hood, matrix multiplication (and also some calculus) is very helpful.

Testing different classifiers on a churn problem

One highlight for me this week was applying what we’ve learned to a concrete business problem. We worked through an example of predicting churn for a telecommunications company, and then building profit curves for various approaches to the modeling problem. Basically, we assigned real costs or benefits to a model making predictions correctly or not. I think this is such an important point for data scientists, to have that insight into both the math/science and the business perspectives. Similarly, I’ve heard from data scientists working in industry that much of their job is communicating results to non-data scientists. The skill this requires is not to be overlooked.