My Experience with Flatiron School’s Immersive Data Science Boot Camp

Part 9!

Hamilton Chang
6 min readMay 1, 2020

Hello everyone and welcome back to my ongoing serial on my time at Flatiron School learning to become a Data Scientist.

This week was a tough one, partly because we were introduced to Time Series, and to be honest, myself and the rest of my cohort felt that we didn’t get a lot of lecture time or a very thorough explanation of time series and its applications. We were told that the focus of the data scientist world has moved beyond time series, and more towards machine learning, but that a lot of companies like banks still use it for basic analysis.

As a small editorial aside, at the time I attended, Flatiron was still tackling how to fit Time Series into the curriculum. As was explained by our instructor, the previous cohort were given their time series lessons and project at the end of their training period, just before they were to start on their capstone project. While we were told that the focus was indeed moving away from time series and that more interesting work was being done with RNNs, our instructor did tell us that she had an interview question from Morgan Stanley that asked her to define the 3 major components of ARIMA (p, d, q, or the number of Auto Regressive terms we are going to include, the number of times we are differencing our data, and the number of moving average terms we are including, respectively). Naturally we took a lot of notes on this.

Our first lesson was an introduction on how to work with time series data. Some classic examples include things like the temperature in January recorded daily, weekly average price of a stock over the course of a year, or the average annual budget of a government organ over 30 years. Basically, any continuous data that can be measured regularly over a period of time. We examine this data and try to identify trends that will help us make predictions on future data. Sometimes this can be easy, and sometimes it requires a lot of work.

An example of time series data

Time series data can be messy, what data scientists want to try to do is identify a pattern known as stationarity. Stationarity is when the summary statistics of our data moves independent of time. Data that does move with time can be just ordinary, or can be a result of seasonality, such as an uptick in average sales during Christmas time at Walmart. Seasonality can be accounted for, and there are models to compensate for it, stationarity can be a bit more difficult. As we saw in linear regression, sometimes data doesn’t behave in a way that we can easily work with, so we do things to it like taking logs or squaring the data to regularize it and build better models. The same goes for stationarity in time series data, only we apply what’s called differencing to separate our data from being reliant on time.

Differencing is a fairly simple concept. In order to stabilize our data with differencing, all we have to do is take the difference between our first data point and the previous one. Now transforming with logs will help stabilize variance in our data, but difference helps stabilize the mean, and this helps reduce the effect of trends and seasonality.

There are a number of ways we can test for stationarity. One of the most popular methods is known as the Dickey-Fuller test. The Dickey Fuller test is a statistical test to determine if the data set has some time-dependent structure behind it. Its results return p-values as it is testing for Unit-Root, specifically, its testing the Null Hypothesis that a Unit Root is present, which indicates that there is some sort of time dependent trend in the data. The Alternate Hypothesis is that we reject the Null Hypothesis, and that there is no Unit Root, and thus, the data is stationary. P-values less than or equal to .05 indicate that we may reject the Null Hypothesis, and that there is no Unit Root. Obviously, any p-value returned over .05 means that we do not have stationarity.

Moving on from the Dickey-Fuller tests, we can also use Auto Correlation Function (ACF) and Partial Autocorrelation Function (PCF) plots to determine stationarity. An ACF plot visualization will help us determine if there is in fact stationarity. Stationarity in data will drop rapidly to zero in our ACF plot, but if we see a slow decline or a great deal of variance, we know we don’t have stationarity. Below are some examples of some messy ACF and PCF plots.

Slow drop to 0, and past it.
We see the same here.

Beyond these test, we learned how to use ARIMA and ARIMAX. ARIMA stands for AutoRegressive Integrated Moving Average, and it is a model that combines Auto Regression Models with Moving Average Models. What are those? I’ll drop a simple definition for you below:

AR Models : An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step. It is one of the simplest time series models in which we use a linear model to predict the value at the present time using the value at the previous time.

Expressed as: Today = constant + slope×yesterday + noise

MA Models: If a previous time period experiences a shock it may cause an error for future predictions if we use just that value. A moving average model helps address this behavior. An MA model assumes present value is related to errors in the past — includes memory of past errors.

Expressed as: Today = Mean + Noise + Slope×yesterday’s noise

So what can we do with ARIMA models? Well as you can probably tell from the breakdowns, by combining AR and MA models, we can more accurately predict future data points. It is important to note that data you run an ARIMA must be differenced. It is simply how the math functions. Not only is it more accurate, but its simple enough not to overfit on data as compared to other more complicated models, yet still possesses the ability to capture relationships in our data, especially out in the wild.

ARIMAX is similar to ARIMA, the X stands for eXogneous variables and is suitable for when we have multivariate date, that is, data that has more than one explanatory variable. For example, the GDP of a country has more than one variable that causes it to rise or fall, and thus ARIMAX would be a more appropriate model.

At the end of our two lectures we were assigned a new time series project. This project differed from others in that it was merely an exercise of what we learned from our lectures. We were given the necessary data, and we partnered up to work on the new models we learned.

That’s all from me this week, hope you all enjoyed the ride!

Click the link below for Part 10!

https://medium.com/@hammychang/my-experience-with-flatiron-schools-immersive-data-science-bootcamp-f89a42f4a65

--

--

Hamilton Chang
Hamilton Chang

Written by Hamilton Chang

Data Scientist, Financial Planner. Trying to educate and make information accessible to EVERYONE. Let’s Connect! shorturl.at/aBGY5

No responses yet