Prepping for Simple Image Recognition

This week at Galvanize, we covered a variety of topics, from web scraping to clustering techniques. I want to focus on dimensionality reduction today, as it’s a challenging but crucial technique for working with real-world data.

This week also marked our first foray into working with text and image data. Up to this point, we’d always started from nice tabular, numerical data. Machine learning algorithms really only understand numbers, so we must first translate our text or images into something the machine understands. We want to give our algorithm a feature matrix– really, just something resembling a spreadsheet, with our features as columns across the top and each data point as a row in the spreadsheet. Each ‘cell’ would then be a number representing that data point’s value for that feature. How would you turn an image into such a table?

It turns out you break each image down into pixels, and assign a value that corresponds to the shade of that pixel. Our features will be each pixel location. If we’re talking grayscale images, the numeric value in each row of our table will be how light or dark that specific pixel is in the row’s image, usually on a scale from 0 (black) to 255 (white). You can imagine this turns images into huge amounts of data– for example, for one little 8×8 pixel image, you now have 64 numerical features.

Text and image data quickly snowballs into thousands of features– which is a problem for some of our models, In fact, it’s such a problem it’s known as the “curse of dimensionality.” To address this, there are methods to identify and extract the most important components or transformed features, and use those in your model.

MNIST, a classic machine learning dataset for image recognition.

MNIST, a classic machine learning dataset for image recognition.

We illustrated this with the MNIST dataset, a classic dataset for image recognition. MNIST is a bunch of handwritten digits, as seen above in a sample of 100 such images. It’s easy for us to look at any given image and recognize it as a 2 or a 4, but how could we train a computer to do this?

You might recognize this as a classification problem (is the image class ‘2’ , class ‘3’, class ‘4’, etc?). Our goal in one exercise this week was to perform just the pre-processing required before one could use a classifier. We used a dimensionality reduction technique called principal component analysis (PCA) to pull out the most important transformed features, or components, from the dataset.

The image below shows what happens when you project the transformed dataset (here using only 0s – 5s) down into 2 dimensions, so it can be easily visualized. The points are the true labels, or classes, of the image. The x- and y-axes are transformed features from our original feature matrix of pixels. One drawback of PCA is that transformation makes the features challenging to interpret– they are no longer the columns from our original feature matrix, but some weird combination of them that we can’t easily name.

mnist

After PCA, the data projected into 2-dimensions

Already, you’ll notice that the 4’s are showing up near each other, the 0’s are grouped away from the 4’s with little overlap, while the 2’s and 3’s overlap a lot– those digits are visually much more similar than a 0 and 4, no? If we were to cluster this dataset, you can imagine getting pretty decent results from using only these first 2 principal components.

Applying Data Science to Business Problems

At this point, I am officially 1/3 of the way through the Galvanize Data Science Immersive. It’s amazing to think about how much I’ve learned in just a few weeks.My programming skills are certainly leaps and bounds above where they were when I started, in large part due to spending hours coding each and every day. Practice really does pay off!

We spent this week building on the algorithms learned last week (mainly decision trees). We learned how to make better predictions by combining multiple models into what are called ensemble methods:  Random Forests and bagged or boosted trees. While I won’t delve into details at this point, the big picture is adding up so-called “weak learners” (that is, models that are only slightly better than random guessing) allows you to emerge with a better-performing predictive model.

A piece of advice I’ve gleaned, relevant for anyone interested in getting into data science, is that a solid understanding of linear algebra will help you when it comes to implementing machine learning algorithms. Thinking about the shape of your data at every step can save you a lot of painful debugging. You can of course use existing Python  libraries like scikit-learn that will take care of much of this for you, but to really understand what’s going on under the hood, matrix multiplication (and also some calculus) is very helpful.

line graph of profits

Testing different classifiers on a churn problem

One highlight for me this week was applying what we’ve learned to a concrete business problem. We worked through an example of predicting churn for a telecommunications company, and then building profit curves for various approaches to the modeling problem. Basically, we assigned real costs or benefits to a model making predictions correctly or not.  I think this is such an important point for data scientists, to have that insight into both the math/science and the business perspectives. Similarly, I’ve heard from data scientists working in industry that much of their job is communicating results to non-data scientists. The skill this requires is not to be overlooked.

 

Making Predictions from Data

This week at Galvanize we are truly starting to dig into the fun part of data science — making data-driven predictions! We covered some simple and common regression and classification models– linear and logistic regression, k-nearest neighbors, and decision trees so far. Lost yet? I’ll try to break it down.

Regression: trying to predict a value for a certain observation. Say I know the  income, profession, and credit limit of a person. I might predict their average credit card balance (dollar amount) based on that information.

Classification: trying to predict which category a specific observation falls into. For a silly example, if you know the height and weight of an animal, you might predict if it is a dog or a horse. More realistically, when you apply for a loan, a bank is likely to look at your income, profession, credit rating, etc. to predict if you are likely to default on your loan (Yes/No).

pug dog in sweater

Dogs are more likely to wear sweaters.

You make either type of prediction by first building a model (maybe a mathematical equation) based on some set of training data. Training data are sets of observations (i.e. people, animals) for whom you know the answer. Once you’ve built a model that you like, you would test it against new data that the model hasn’t seen before. But you still know the answers for this data, so you can compare what your model predicted to the true answer for each observation. This gives you a sense of how good your model is at predicting the right answer.

Once you have a model that’s performing well, you could release it into the wild and try to make predictions on brand new data. Hopefully, you can learn some actionable information from your model–  like whether you should pet or ride this animal, or whether to extend a loan to this person.

Google’s DeepMind and Healthcare

I’d filed away to read later a Fast Company article whose headline proclaimed Google will be applying artificial intelligence to healthcare problems. Upon returning to read it, I was a bit disappointed to see how speculative the article was. Basically, Google acquired a company with a messaging app for hospital staff that streamlines communication. It’s been hinted that Google might apply artificial intelligence tools to help identify patients at risk of kidney failure whom a clinician might not deem at risk.

stethoscope and smartphone

Joining forces?

Even if machine learning were applied to predict which patients might be at risk of kidney failure,  that’s not really the end of the story. Once patients are identified, there have to be effective interventions to help them, and perhaps moreover, the entire system where this is occurring needs to allow for these predictions and interventions to take place. Looking at the big picture, is identification of at-risk patients and communication among clinicians the true ‘problem’ that needs to be solved? Or is there a different systemic problem or bottleneck that is truly responsible for delaying care? While reading the article, I found myself nodding in agreement to this excerpt:

[S]ome health experts fear that this kind of technology is just putting a Band-Aid on a broken system… “Some people have this utopian plan that you can sprinkle some AI on a broken health system and make things better,” says Jordan Shlain, a Bay Area-based doctor and entrepreneur who has advised the NHS.

Overall, I’m really excited about the idea of using data and machine learning to improve care, but it’s important to be realistic about where these tools can help. I think the promise to fix “broken systems” is overinflated. Artificial intelligence might help us identify at-risk patients, make better diagnoses, or select specific treatment plans, but at the end of the day, healthcare systems are built by and made up of people– and I’m not sure machines can fix those systems.