My Experience With Flatiron School’s Immersive Data Science Boot Camp

Part 7

Hamilton Chang
4 min readApr 10, 2020

Hello and welcome back to my series on my Flatiron School experience. This week, we are ramping up into some true Data Science projects.

We started off the week learning more about linear algebra. Why linear algebra? Because it’s the foundation upon which a lot of machine learning models are based!

Let’s start with some basic principles. First, linear algebra introduces us to the follow data types.

Scalars — Are represented by a single point in space, or a number, and have only magnitude, or size as a feature.

Vectors — Are an array that feature both magnitude and direction. Such that the coordinates would tell you where the tip of vector is from its origin, and the magnitude would be its length.

Matrix — Used to measure multiple simultaneous vectors

Tensor — Made up of multiple matrices with the same dimensions

We can visualize these using Numpy Arrays like so:

These different data shapes or types are the underlying foundation for a lot of machine learning as it efficiently translates things like words and images into a format that can be computed. It is also particularly helpful in calculating for classification, which is a key feature of machine learning and linear regression.

For example, let’s say we want to teach one of our models how to recognize an image. Well, how would we display an image to a mathematical model? What would be the best way to feed that information into what is essentially, marks on a paper? Ideally, we would break it down into numbers. Each pixel of an image can be represented as a number based on its shade, and each one of those can be assembled as a representative matrix of the image itself!

source: https://towardsdatascience.com/histograms-in-image-processing-with-skimage-python-be5938962935

Now this is going a little beyond what we covered that week, but I think its just an amazing example of the usage of linear algebra in modern day machine learning methods.

The next thing we learned about was multiple linear regression. Linear regression works well in a 2 dimensional format. That is, if we were to have data that used only 2 features, we could easily use linear regression to plot out a model that would give us an accurate prediction. Like the graph below:

But what if we had more than 2 features, what if there were 3? What would the plot look like?

Oof, that graph is tough to read. You’d have to be able to animate it and rotate it. It wouldn’t work on paper, you’d actually have to physically build a model out of…wire and cardboard maybe. What if we had even more features! That wouldn’t be physically possible, not without the ability to transcend our dimension!

So once we have more than 2 features, we tend to move into the realm of mathematics as an abstract. There’s no physical plotting, only what we know to be true from our mathematical calculations. We can pierce the 4th, the 5th, to a near infinite level of dimensions, but we cannot physically observe them.

In any case, we practice doing multiple linear regression using statsmodels OLS library. We were also introduced to the concepts of Adjusted R², where we test for R², but add a penalty for the increased number of features, and multicollinearity, or when two independent features seem to have a high degree of correlation. All these are used to test for the strength of our model, and having more of one or the other can cause problems depending on the purpose of our model.

For example, multicollinearity only affects the variables in which it is exhibited. If you aren’t particularly concerned with those variables, you can safely ignore it. It also only affects coefficients and p-values, but does not affect predictions or goodness of fit. So if your model is turning out good predictions, and that’s really ALL you care about, then you can ignore multicollinearity as well.

Lastly we did a more thorough review of EDA, things to look for, best practices when examining data, trying to tease out relationships visually with matplotlib before doing the hard math.

I think that’s where we’ll end things this week, bye for now!

Click the link below for Part 8!

https://medium.com/@hammychang/my-experience-with-flatiron-schools-immersive-data-science-boot-camp-d780c81ae9d9

--

--

Hamilton Chang

Data Scientist, Financial Planner. Trying to educate and make information accessible to EVERYONE. Let’s Connect! shorturl.at/aBGY5