My Experience With Flatiron School’s Immersive Data Science Bootcamp

Part 12!

Hamilton Chang
6 min readJun 11, 2020

Hello Hello everyone. Welcome to my hard hitting expose on what its like inside Flatiron School! Haha just kidding. I hope you all realize that I’m writing this so those you who are considering pursuing a career in Data Science and want to join a boot camp to do so can make an informed decision on what’s involved. I want to take a moment and go over what I’ve learned for those of you just joining.

Credit: Andrea Sutinen. Source: https://www.redbubble.com/people/petakov-kirk/shop#profile

The first month we were learning programming, not to be super software engineers, but because Python and SQL are the method by which we are to operate our machines. A Data Scientist doesn’t have to be a Python Wizard to become a great Data Scientist. It certainly helps though!

The second month we were learning about Statistics, Probability, Linear Algebra and Calculus. If you have difficulty with math, this part of the Data Science curriculum can be very tough for you. During that period of time, I was doing a lot of reading, a lot of writing, a lot of writing out equations and asking people for math help to make sense of it all. Khan Academy definitely helped a great deal with figuring out Bayes Theorum. I even wrote a blog about probability and the lottery, but even though I was PRETTY sure my math was sound, I still asked the coaches and one of my math major classmates to confirm it! To go along with our newfound math skills, we were introduced to linear regression and our first attempt at modeling predictions. If staring at dots and graphs of lines all day is tough for you, then you’ll have a hard time here too!

Our Third month was our first foray into Machine Learning. I’ve said this before, and I’ll say it again, Machine Learning is just really complicated math that’s done by computers. It is not going to take over the world, but for now, its there to make your life easier, if a little creepy. I’ve gone into a brief overview of Machine Learning in a previous post. Suffice to say, you must be very very good at probability, statistics, and linear algebra if you’re going to want to completely understand machine learning. You can gain a rudimentary understanding and still work with machine learning, but to be truly useful, it is important to understand its inner workings.

So enough of a recap. This week, week 13, was devoted exclusively to working on our Machine Learning projects. I know I went into some detail about it in my last post about it, but we’re going to be diving a bit deeper into it. One of the most useful tools necessary to evaluate the effectiveness of your machine learning modules is the Confusion Matrix, called so because it is a lattice of results that display the confusion of the classifier. At least that’s the speculation, since it was originally invented in 1904 by Karl Pearson, the father of modern statistics and thus incredibly important to Data Science, and no one asked him to clarify it. Let me share a useful bit of code with you first to get it started:

labels = ['Scotland', 'USA', 'Canada', 'Ireland', 'Japan', 'Rest_of_World']
def plot_confusion_matrix(y_true, y_pred, classes, normalize=False, title=None, cmap=plt.cm.Blues):
# Compute confusion matrix
cmat = pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
print(cmat)
cm = confusion_matrix(y_true, y_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
ax.figure.colorbar(im, ax=ax)
# We want to show all ticks...
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
# ... and label them with the respective list entries
xticklabels=classes, yticklabels=classes,
title=title,
ylabel='True label',
xlabel='Predicted label')
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
return ax
np.set_printoptions(precision=1)
# Plot non-normalized confusion matrix
plt.show()

This is a fun function that helps create confusion matrixes, and is particularly helpful when you’re dealing with multiclass classification like I had with classifying whiskeys. You’ll notice I’m establishing the labels before defining the function, feel free to add whatever labels you want when building your models.

The end result is not only a confusion matrix, but a confusion matrix…heatmap I guess so that successful labels are more visible, as seen below:

Left: Decision Trees. Right: K-Nearest Neighbors

These confusion matrixes tell us how well our models have performed by showing us the number of true positives, true negatives, false positives, and false negatives. If what I’m saying is confusing to you, I may do another post to dive into what those means.

Either way, using the Confusion matrix, we can determine if our model accurately predicted the national origin of the whiskey. We trained out models on labeled data, and hopefully, it would have learned enough to accurately classify whiskeys it had not seen before. As you can see in Decision Trees, it correctly classified only 28 Scotches, and guessed the rest were from Canada, Ireland, Japan, and the rest of the world, but it correctly classified US bourbons. In KNN, It correctly classified 145 Scotches, but also lumped a lot of the Rest of the World whiskies into Scotch.

Seeing these two Confusion Matrixes, which do you think is the more accurate one? This is where domain knowledge of whiskey plays a role in our decision making. For me, initially I thought KNN was great, because it got all the Scotches. That’s a pretty damn high success rate! But when you realize its just making EVERYTHING a Scotch, that’s not so good.

The first thing you need to pack into your domain knowledge, is that Scotch whiskey is considered the gold standard for the whiskey industry. The Scottish Whisky industry has been in existence for hundreds of years, and have their process down to somewhat of a science. That time and knowledge has also allowed them to experiment with hundreds of different variations to help them settle on improvements and variations. Any new distillery are going to start with what they know, and what they know is Scotch whiskies. So a lot of first run whiskies are going to taste similar to scotches, because they know if it is good, it will sell. Once they’ve developed enough of a following, they can move on to more interesting things. Only in developed markets with a hardcore fanbase can varieties like bourbons proliferate.

So, knowing that Scotches tend to dominate, and that the rest of the world tends to imitate scotches, do we think the KNN or the Decision Trees model is better? Well if we want our model to actually CLASSIFY whiskies, and not just be a Scotch Detector, we will prefer the model to make mistakes on scotch, but be able to distinguish a non-scotch better. Thus, in this context I would prefer to use the Decision Trees model in spite of its poorer performance with scotches.

That’s it from me this week. I may blow this out more in next week’s blogs because I feel like there’s more to be said on the scotch industry and our confusion matrix. Until next time!

Click the link below for the Final Part: Finding a J O B!

https://medium.com/towards-data-science/i-attended-flatiron-school-and-got-a-job-in-data-science-aeda2b69b02b

--

--

Hamilton Chang

Data Scientist, Financial Planner. Trying to educate and make information accessible to EVERYONE. Let’s Connect! shorturl.at/aBGY5