This week at Galvanize we are truly starting to dig into the fun part of data science — making data-driven predictions! We covered some simple and common regression and classification models– linear and logistic regression, k-nearest neighbors, and decision trees so far. Lost yet? I’ll try to break it down.
Regression: trying to predict a value for a certain observation. Say I know the income, profession, and credit limit of a person. I might predict their average credit card balance (dollar amount) based on that information.
Classification: trying to predict which category a specific observation falls into. For a silly example, if you know the height and weight of an animal, you might predict if it is a dog or a horse. More realistically, when you apply for a loan, a bank is likely to look at your income, profession, credit rating, etc. to predict if you are likely to default on your loan (Yes/No).
You make either type of prediction by first building a model (maybe a mathematical equation) based on some set of training data. Training data are sets of observations (i.e. people, animals) for whom you know the answer. Once you’ve built a model that you like, you would test it against new data that the model hasn’t seen before. But you still know the answers for this data, so you can compare what your model predicted to the true answer for each observation. This gives you a sense of how good your model is at predicting the right answer.
Once you have a model that’s performing well, you could release it into the wild and try to make predictions on brand new data. Hopefully, you can learn some actionable information from your model– like whether you should pet or ride this animal, or whether to extend a loan to this person.