I am a confessed health nerd, and have been working on immunization projects for several years, so it’s only natural that one of my first data projects was around vaccination stats.
In Fall of 2014, I took an excellent introduction to R through the Berkeley Extension school. I know several academics who use R in their research, and though I had only been briefly exposed to it in grad school, I was interested in really learning it. So I enrolled in the course, a little unsure about what I was going to find. Ultimately I think I really benefited from having a well-organized overview of working with data programmatically, and it was the first time I was introduced to a few key computer science concepts. The instructor was even kind enough to coach me through writing my first for loop.
The course culminated in a final project, in which we were to clean and analyze a dataset of our choosing and generate a few data visualizations with it. I knew that the state of California had an open data portal with some health-related data sets and I was intrigued to find school-level data on immunization rates and personal belief exemptions for Kindergarten and 7th- grade students.
Personal belief exemptions (PBE) occur when a parent requests exemption from the immunization requirement for school entry because all or some immunizations are contrary to the parent’s belief. These exemptions are pretty controversial, and have almost certainly contributed to outbreaks of vaccine preventable diseases. In June of 2015, California eliminated these non-medical exemptions, turning California from one of the most lenient to one of the strictest states in enforcing vaccine requirements for school entry.
You can see my code here and download the datasets from the California Department of Public Health.
My main findings for this project included the fact that there is a statistically significant difference in PBE rates between public and private schools in California, and between small vs. large schools (though both are low in the aggregate– more on that later). These are probably two sides of the same coin, as private schools tend to be smaller than public schools.
At the time, I was interested in finding which counties had the highest PBE rates in their school systems. While one usually hears about high PBE rates in private schools in LA or Marin County, it was interesting that the counties with the highest PBE rates are mostly sparsely-populated counties in northern California. Looking back on it now, those are also counties with fewer schools (and students) overall, so it’s probably not the best method of assessing risk. Overlaying PBE rates onto a map of California would probably be a more interesting way to visually represent this data, but alas that was beyond my skills at the time. Maybe I’ll find the time soon to go back and do that.
There are also major caveats with relying on school records for assessing immunization rates– are those students actually unvaccinated, or are there just incomplete records at the school?
I’ve since read and contributed to other research on this issue (that used much more advanced methods than here). Immunization rates often have extremely local consequences and aggregating across counties is not necessarily that helpful. Unvaccinated/undervaccinated communities are often quite small but concentrated. While the rate in a given county may seem low, there are often specific schools with shockingly high PBE rates. That results in pretty high risk for people in those schools with many unvaccinated students (who are mostly interacting with other at-risk, unvaccinated individuals).
While this project certainly wasn’t anything groundbreaking and I’m sure there are much better statistical approaches to the dataset, it was a nice way to mingle my first forays into data science and analysis with my interest in public health. Also can I just say that I love ggplot? Trying to plot things in Python makes me really appreciate how much simpler I found plotting in R.