My Experience with Flatiron School’s Immersive Data Science Boot Camp — Part 6

Hamilton Chang
3 min readMar 27, 2020

--

Part 6!

Hello Everyone, this week we’re going to go back to discussing my experience as a Data Science student at Flatiron School’s Immersive Data Science Boot Camp. I promise we’ll get to more Data Science-y stuff next week, but I’m currently working on a Kaggle Competition and its been occupying most of my time and brain power. So I don’t have a lot of room in the old noggin for taking on additional side projects at the moment.

So, Week 5 started off with a bit more Pandas, learning how to manipulate Panda’s various commands to clean data, and isolate useful features. We also learned the most useful feature of Pandas when it comes to manipulating data, .get_dummies(), which allows us to turn categorical features like words, into continuous features, like numbers. Continuous features are desirable obviously because it allows us to run and endless amount of mathematical models like linear regression, supervised and unsupervised learning. After all, most of data science is math, and you can’t just plug “Red” or “Green” into an equation. Get_dummies() replaces Red or Green with a new column that returns Red or Green with a 1 or 0. It means you aren’t fine tuning your data with percentages, but simply the presence or absence of that particular variable.

This was all combined into a complete lesson on best practices for cleaning messy data. Reorganizing it into a pleasing shape that we could conduct both EDA, as well as applying various models to it, and knowing these these things is crucial to becoming a good data scientist.

The rest of the week was devoted to learning linear regression and how to use it manually and in a coding environment.

To start we were taught the basics of linear regression, how a fitted line drawn through plotted data both helps us determine if there is a correlation between values, and if we can predict future values based on the line’s path.

Points that lie off the primary fitted line are measured for their distance from that line, and this distance is called observed error, but more commonly called the residual.

We also learned about the Pearson Correlation Coefficient, which is a measure of the correlation between two variables. Using scikit learn’s Pearson library, we tested it out using Python.

Here we are testing Pearson’s Coefficient on two different data frames, the first is a list of weight and heights, the second is a an IMDB database of movies including their listed budget and total gross. In the first run, we find that Height and Weight are strongly correlated, however, in the second run, budget and gross are not as closely correlated, but there is still a relationship since our p-value as calculated by the library is 0.

Lastly, we learned about ANOVA, and what situations would best be used for applying.

We finished the week out with our linear regression project, which I’ve outlined in previous blogs, specifically using OLS to calculate the price of a used car.

Thanks very much for tuning in, and next week I’ll have more for you!

Click the link below for Part 7!

https://medium.com/@hammychang/my-experience-with-flatiron-schools-immersive-data-science-boot-camp-ebbf7ffefdeb

--

--

Hamilton Chang
Hamilton Chang

Written by Hamilton Chang

Data Scientist, Financial Planner. Trying to educate and make information accessible to EVERYONE. Let’s Connect! shorturl.at/aBGY5

No responses yet