Entering the Project Phase

We’ve finished up the structured curriculum at galvanize and are now working on our capstone projects. I had a bit of flip flopping in picking a project topic. While I’m (obviously) really interested in healthcare applications of data science, there are fewer open data sources available. I poked around a few available datasets but ultimately decided on a non-healthcare project that I think will challenge me more and give me more opportunity to show off my newly-developed skills. Also it’s food-related, which is probably right up there with healthcare on my interests list.

table set for brunch with waffles on plate

Yum!

For the past week, I worked on building a cross-domain recommender system that will allow a user to input one of their favorite restaurants in SF and receive a recipe they might enjoy. So far, I have a simple model that will compute the similarity between the given restaurant’s menu and all the recipes in my database, based on text analysis of the menu’s item descriptions and the recipes ingredient list. I also have a Flask app up and running (only on my local machine so far, so no links to it yet but soon!) I’m pretty happy with my progress this week, as I still have the upcoming week to play around with the model. I’d like to try some more sophisticated text mining techniques that will hopefully result in better recommendations.

If you’re interested in building simple web apps or websites, I definitely recommend checking out Bootstrap templates. See Start Bootstrap for some free, downloadable templates. With just a tiny bit of HTML and CSS knowledge, you can customize the templates and make your site look really slick. It took me maybe an hour to go from a barebones, text-only website to a nicely formatted, image-rich design. Can’t wait to share the finished product!

Halfway there!

I’ve officially halfway completed the data science immersive (DSI) program at Galvanize. I actually had the past week off from classes as a chance to review and solidify what we’ve learned so far and also to start thinking about what I’d like to do for my capstone project. The capstone project serves as (a start to) a data science portfolio. We will eventually present our projects to recruiters from various tech companies, so it’s really a chance to show off what I’ve learned over the course of the DSI.

portfolio

I’m mulling over a few different ideas– ideally, I’d love to do a health-themed project as that’s a topic I’m both interested in and have subject matter expertise in. Unfortunately, finding good datasets can be a challenge. I’d previously been advised to get experience working with patient-level or claims data, but obviously there are fewer open sources of patient data since it’s a pretty clear privacy issue. CMS (Centers for Medicare & Medicaid Services) does have some limited, de-identified datasets available, so perhaps I can find an interesting question to answer based on that data. It’s funny– it feels like a bit of a backwards approach (starting with a dataset instead of a question) but it’s reasonable, given our time constraints. We have about two and a half weeks to construct our projects, so you can’t get too hung up on constructing unique datasets.

 

Prepping for Simple Image Recognition

This week at Galvanize, we covered a variety of topics, from web scraping to clustering techniques. I want to focus on dimensionality reduction today, as it’s a challenging but crucial technique for working with real-world data.

This week also marked our first foray into working with text and image data. Up to this point, we’d always started from nice tabular, numerical data. Machine learning algorithms really only understand numbers, so we must first translate our text or images into something the machine understands. We want to give our algorithm a feature matrix– really, just something resembling a spreadsheet, with our features as columns across the top and each data point as a row in the spreadsheet. Each ‘cell’ would then be a number representing that data point’s value for that feature. How would you turn an image into such a table?

It turns out you break each image down into pixels, and assign a value that corresponds to the shade of that pixel. Our features will be each pixel location. If we’re talking grayscale images, the numeric value in each row of our table will be how light or dark that specific pixel is in the row’s image, usually on a scale from 0 (black) to 255 (white). You can imagine this turns images into huge amounts of data– for example, for one little 8×8 pixel image, you now have 64 numerical features.

Text and image data quickly snowballs into thousands of features– which is a problem for some of our models, In fact, it’s such a problem it’s known as the “curse of dimensionality.” To address this, there are methods to identify and extract the most important components or transformed features, and use those in your model.

MNIST, a classic machine learning dataset for image recognition.

MNIST, a classic machine learning dataset for image recognition.

We illustrated this with the MNIST dataset, a classic dataset for image recognition. MNIST is a bunch of handwritten digits, as seen above in a sample of 100 such images. It’s easy for us to look at any given image and recognize it as a 2 or a 4, but how could we train a computer to do this?

You might recognize this as a classification problem (is the image class ‘2’ , class ‘3’, class ‘4’, etc?). Our goal in one exercise this week was to perform just the pre-processing required before one could use a classifier. We used a dimensionality reduction technique called principal component analysis (PCA) to pull out the most important transformed features, or components, from the dataset.

The image below shows what happens when you project the transformed dataset (here using only 0s – 5s) down into 2 dimensions, so it can be easily visualized. The points are the true labels, or classes, of the image. The x- and y-axes are transformed features from our original feature matrix of pixels. One drawback of PCA is that transformation makes the features challenging to interpret– they are no longer the columns from our original feature matrix, but some weird combination of them that we can’t easily name.

mnist

After PCA, the data projected into 2-dimensions

Already, you’ll notice that the 4’s are showing up near each other, the 0’s are grouped away from the 4’s with little overlap, while the 2’s and 3’s overlap a lot– those digits are visually much more similar than a 0 and 4, no? If we were to cluster this dataset, you can imagine getting pretty decent results from using only these first 2 principal components.

Applying Data Science to Business Problems

At this point, I am officially 1/3 of the way through the Galvanize Data Science Immersive. It’s amazing to think about how much I’ve learned in just a few weeks.My programming skills are certainly leaps and bounds above where they were when I started, in large part due to spending hours coding each and every day. Practice really does pay off!

We spent this week building on the algorithms learned last week (mainly decision trees). We learned how to make better predictions by combining multiple models into what are called ensemble methods:  Random Forests and bagged or boosted trees. While I won’t delve into details at this point, the big picture is adding up so-called “weak learners” (that is, models that are only slightly better than random guessing) allows you to emerge with a better-performing predictive model.

A piece of advice I’ve gleaned, relevant for anyone interested in getting into data science, is that a solid understanding of linear algebra will help you when it comes to implementing machine learning algorithms. Thinking about the shape of your data at every step can save you a lot of painful debugging. You can of course use existing Python  libraries like scikit-learn that will take care of much of this for you, but to really understand what’s going on under the hood, matrix multiplication (and also some calculus) is very helpful.

line graph of profits

Testing different classifiers on a churn problem

One highlight for me this week was applying what we’ve learned to a concrete business problem. We worked through an example of predicting churn for a telecommunications company, and then building profit curves for various approaches to the modeling problem. Basically, we assigned real costs or benefits to a model making predictions correctly or not.  I think this is such an important point for data scientists, to have that insight into both the math/science and the business perspectives. Similarly, I’ve heard from data scientists working in industry that much of their job is communicating results to non-data scientists. The skill this requires is not to be overlooked.

 

Making Predictions from Data

This week at Galvanize we are truly starting to dig into the fun part of data science — making data-driven predictions! We covered some simple and common regression and classification models– linear and logistic regression, k-nearest neighbors, and decision trees so far. Lost yet? I’ll try to break it down.

Regression: trying to predict a value for a certain observation. Say I know the  income, profession, and credit limit of a person. I might predict their average credit card balance (dollar amount) based on that information.

Classification: trying to predict which category a specific observation falls into. For a silly example, if you know the height and weight of an animal, you might predict if it is a dog or a horse. More realistically, when you apply for a loan, a bank is likely to look at your income, profession, credit rating, etc. to predict if you are likely to default on your loan (Yes/No).

pug dog in sweater

Dogs are more likely to wear sweaters.

You make either type of prediction by first building a model (maybe a mathematical equation) based on some set of training data. Training data are sets of observations (i.e. people, animals) for whom you know the answer. Once you’ve built a model that you like, you would test it against new data that the model hasn’t seen before. But you still know the answers for this data, so you can compare what your model predicted to the true answer for each observation. This gives you a sense of how good your model is at predicting the right answer.

Once you have a model that’s performing well, you could release it into the wild and try to make predictions on brand new data. Hopefully, you can learn some actionable information from your model–  like whether you should pet or ride this animal, or whether to extend a loan to this person.

Google’s DeepMind and Healthcare

I’d filed away to read later a Fast Company article whose headline proclaimed Google will be applying artificial intelligence to healthcare problems. Upon returning to read it, I was a bit disappointed to see how speculative the article was. Basically, Google acquired a company with a messaging app for hospital staff that streamlines communication. It’s been hinted that Google might apply artificial intelligence tools to help identify patients at risk of kidney failure whom a clinician might not deem at risk.

stethoscope and smartphone

Joining forces?

Even if machine learning were applied to predict which patients might be at risk of kidney failure,  that’s not really the end of the story. Once patients are identified, there have to be effective interventions to help them, and perhaps moreover, the entire system where this is occurring needs to allow for these predictions and interventions to take place. Looking at the big picture, is identification of at-risk patients and communication among clinicians the true ‘problem’ that needs to be solved? Or is there a different systemic problem or bottleneck that is truly responsible for delaying care? While reading the article, I found myself nodding in agreement to this excerpt:

[S]ome health experts fear that this kind of technology is just putting a Band-Aid on a broken system… “Some people have this utopian plan that you can sprinkle some AI on a broken health system and make things better,” says Jordan Shlain, a Bay Area-based doctor and entrepreneur who has advised the NHS.

Overall, I’m really excited about the idea of using data and machine learning to improve care, but it’s important to be realistic about where these tools can help. I think the promise to fix “broken systems” is overinflated. Artificial intelligence might help us identify at-risk patients, make better diagnoses, or select specific treatment plans, but at the end of the day, healthcare systems are built by and made up of people– and I’m not sure machines can fix those systems.

My First Hackathon

When I was first getting started with programming and data science, one piece of advice I heard repeatedly was to attend a hackathon. Initially, I was pretty skeptical– I pictured staying up all night fueled on Red Bull vying to win a prize. This was unappealing because 1.) I’m not overly competitive, and 2.) I like to sleep. But when I found a hackathon weekend focused on “civic hacking,” or using publicly available data for civic good, I figured it was worth a shot.

coffee mug and laptop

I also pictured coffee. Lots of coffee

And so I found myself one weekend in June at the 2015 SF Day of Civic Hacking at SF State. One of the projects pitched was around the health impacts of climate change, which appealed to me. My hope for the day was to pair with a more experienced Python developer and hopefully glean some knowledge from them. Well, instead I found myself in a group with a few other people who were interested in the topic, but none of us were particularly confident in our programming skills. I was the most familiar with R and Shiny apps, so I became the de facto software developer– definitely not what I had expected!

We hoped to build a visualization for policy makers that compared climate change to changes in community health, which would ideally increase concern about global warming. We had a lot of ideas about how we could so, or what our visualization might look like. However, when we got down to it, finding relevant data (and cleaning it) took a lot longer than any of us anticipated.

Our final product by the end of the weekend was a simple R Shiny app that showed changing average temperatures and number of West Nile Virus cases for each county in California.As a caveat, this visualization is primarily hypothesis-generating, as a simple correlation certainly shouldn’t imply a causal link. We initially hoped to include more measures of health than WNV incidence, but we easily found the data for WNV so included that in the prototype.  I also think the tool might be more useful if it were more granular (i.e. more local than county), which might help it ‘come alive’ to people that climate change is having an effect on their community.

 

Health and Climate Impacts

Ok so it’s not the prettiest visualization you’ve ever seen.

To my astonishment– we placed 3rd in the competition. In a weekend full of surprises, this was certainly a happy one. I chalk it up in large part to having a working prototype. Sure it wasn’t perfect, and maybe other people’s ideas were more impressive, but we at least had a small tool that you could click on and interact with.

In the end, I found the weekend pretty fun and I did learn a lot about going from an idea to a working product. I also gained more confidence in building a quick and dirty Shiny app (code is here if you’re interested).

Personal Belief Exemptions in California

I am a confessed health nerd, and have been working on immunization projects for several years, so it’s only natural that one of my first data projects was around vaccination stats.

In Fall of 2014, I took an excellent introduction to R through the Berkeley Extension school. I know several academics who use R in their research, and though I had only been briefly exposed to it in grad school, I was interested in really learning it. So I enrolled in the course, a little unsure about what I was going to find. Ultimately I think I really benefited from having a well-organized overview of working with data programmatically, and it was the first time I was introduced to a few key computer science concepts. The instructor was even kind enough to coach me through writing my first for loop.

The course culminated in a final project, in which we were to clean and analyze a dataset of our choosing and generate a few data visualizations with it. I knew that the state of California had an open data portal with some health-related data sets and I was intrigued to find school-level data on immunization rates and personal belief exemptions for Kindergarten and 7th- grade students.

Personal belief exemptions (PBE) occur when a parent requests exemption from the immunization requirement for school entry because all or some immunizations are contrary to the parent’s belief. These exemptions are pretty controversial, and have almost certainly contributed to outbreaks of vaccine preventable diseases. In June of 2015, California eliminated these non-medical exemptions, turning California from one of the most lenient to one of the strictest states in enforcing vaccine requirements for school entry.

You can see my code here and download the datasets from the California Department of Public Health.

My main findings for this project included the fact that there is a statistically significant difference in PBE rates between public and private schools in California, and between small vs. large schools (though both are low in the aggregate– more on that later). These are probably two sides of the same coin, as private schools tend to be smaller than public schools.

CountyRates

At the time, I was interested in finding which counties had the highest PBE rates in their school systems. While one usually hears about high PBE rates in private schools in LA or Marin County, it was interesting that the counties with the highest PBE rates are mostly sparsely-populated counties in northern California. Looking back on it now, those are also counties with fewer schools (and students) overall, so it’s probably not the best method of assessing risk. Overlaying PBE rates onto a map of California would probably be a more interesting way to visually represent this data, but alas that was beyond my skills at the time. Maybe I’ll find the time soon to go back and do that.

There are also major caveats with relying on school records for assessing immunization rates– are those students actually unvaccinated, or are there just incomplete records at the school?

I’ve since read  and contributed to other research on this issue (that used much more advanced methods than here). Immunization rates often have extremely local consequences and aggregating across counties is not necessarily that helpful. Unvaccinated/undervaccinated communities are often quite small but concentrated. While the rate in a given county may seem low, there are often specific schools with shockingly high PBE rates. That results in pretty high risk for people in those schools with many unvaccinated students (who are mostly interacting with other at-risk, unvaccinated individuals).

While this project certainly wasn’t anything groundbreaking and I’m sure there are much better statistical approaches to the dataset, it was a nice way to mingle my first forays into data science and analysis with my interest in public health. Also can I just say that I love ggplot? Trying to plot things in Python makes me really appreciate how much simpler I found plotting in R.