2x2 Logo

My home city of Madison, Wisconsin is a very bikeable city. It has a number of bike paths that go along useful commuter-friendly routes. While in grad school, I commuted to school via bike and was impressed by how the paths were dependably plowed and cleared of snow in the winter.

The city has installed 2 “eco-totems”’1 in the last decade, electronic objects that display the time, a daily and yearly count of bicycles passing the totem. The totem counts bicycles tallies bycyclists using loops embedded in the path. One was installed along the southwest path near the UW-Madison football stadium, and the other along the capital path on John Nolan Drive. 2

It’s worth noting that these two locations are relatively arbitrary places to put these things within the city. The southwest totem is on the edge of the university campus, further south and west than most students live. I would imagine putting it somewhere further west, near Park Street would lead to much higher counts. The capital totem is located at a busy and scenic crossroads south and west of downtown, where many people bike daily, though again, the path is on the lake.

There are probably better places to put these totems in the city to capture data representing when, and how many people bike in the city. From the perspective of an analyst, this leads to the disclaimer that any conclusions I reach using this data shouldn’t be used to make large generalizations about biking in Madison. The data is an observational sample of biking along two specific routes within a much larger city.

The eco-totem data is available on the city’s data portal. 3 4 The capital sample goes from April 23, 2015 to June 30, 2020. The southwest sample covers January 1, 2015 to June 30, 2020. Count data on the number of bikes passing the totem is logged hourly.

Aggregating the data for each day and plotting across the entire sample, we can see that a clear trend emerges with the seasons.

People love to ride bikes in the summer, not so much in the winter. Yet some brave souls stick it out throughout the brutal cold. It’s cool to see how “smooth” this up and down plot ends up being – people are out biking for many different reasons, with different decision making processes and different thresholds for their tolerance of the current weather. All this complexity can be distilled into the simple wave pattern seen in this plot.

Breaking the data down by month in these box plots further confirms that the time of year is the source of this undulating effect.

Another interesting thing we can see in this plot is that there isn’t a huge student effect. The Capital path being removed from campus, and the Southwest path being on the edge of it, it is interesting that there is not much of a difference in the trends around September. While the mean number of riders in September is nearly as high as August for the southwest path, it is still lower. This coincides with somewhere between 30 to 40 thousand extra people moving into the area during bthis time of the year when classes start. Perhaps the southwest totem is not in a good location to measure stuent bike traffic, as I said in the intro. Also, in my own experience, I’ve noticed that the vast majority of students walk to class. My guess that a little bit of both of these things contributes to the absence of a jump in traffic.

Box plots are also good for spotting outliers, as they are plotted along with the means and percentiles. Note that the outliers occur mostly at specific times of the year: Positive outliers in February, March and November, and negative outliers in the summer. I’d guess that if the high temperatures for those positive winter days were added to the data, we’d see unseasonally warm days. Opposingly, the low outliers during the summer are probably cold and rainy days.

Time Series Analysis

Since these data are time series, I wanted to use them to gain some experience with time series analysis. My two objectives: test whether there are is a significant trend in bike counts over time, and build a model to forecast these counts into the future.

Decomposition

First, we will look for trend in the data. To do this, we want we want to decompose the data into different parts. Time series decomposition separates the data into three parts – trend, seasonal, and remainder. The idea is to separate seasonal and non-seasonal trends, in order to examine of there are trends in the data apart from variation due to seasonality.

There are several standard decomposition methods, including X11 and SEATS decomposition, but I chose to use STL decomposition 5 as it accommodates daily data, whereas the other methods are for quarterly or monthly data.

The most interesting takeaway is the change in variance that is observed in the remaider – higher variance is observed in the summer months while the dead of winter variance is low. This makes intuitive sense – in the dead of winter people who are committed to biking will do it no matter the weather, while in the summer a nice day or a weekend will get a lot of people out while a rainy day will lead to people spending their leisure time elsewhere, or commuting a different way.

Testing Trend Significance

We see in the trend lines of the previous two plots that there is some movement in the trend for both eco-totems. However, also note that the scale of these plots is much smaller than the scale of the seasonal and remainder plots, indicating that the variation of the trend seems much smaller than the seasonal and noise components. This suggests that the trend is probably not significant. Still, it is worthwhile to do a formal test. We wish to test the significance in variation of the trend component – if the trend has significant variation, than the variation of the trend should be much larger than the variation of the remainder. Taking a ratio of the variances of these two components, subtracting from 1 and enforcing a floor of 0 creates an F statistic.

\[F_t = \max(0, 1 - \frac{R_t}{T_t + R_t})\]

## [1] "Capital Path Trend F-statistic"
## [1] 0.03140268
## [1] "Southwest Path Trend F-statistic"
## [1] 0.07738391

Both statistics are very low, indicating that there is no significant trend to be found in the data.

Forecasting

Next, let’s look to forecast the data into the future. Given the strong seasonal trend in the data, I excpect that forecasting can be done pretty accurately in this instance.

Do do this, I will be using an ARIMA model. ARIMA (Auto-regressive Integrated Moving Average) is exactly what the full name states. It combines differencing, auto-regressive and moving average components into one model.

Differencing

Differencing refers to subtracting previous observations from each observation. This helps to create data that follow a stationary distribution, helping remove trend and seasonality. A common form of differencing is first order differencing, where the previous observation is subtracted. The notation used to denote differencing is: \[y'_t = y_t - y_{t-1}\]

Auto-regressive models

An auto-regressive model is one where the current value is dependent on the most recent values in times series, up to p observations before the current time t.

\[y_t = \phi_1y_{t-1} + ... + \phi_py_{t-p} + \epsilon_t\] p is referred to as the order of the model.

Moving average models

A moving average model uses past forecast errors in a regression-like model.

\[y_t = c + \epsilon_{t} + \theta_1\epsilon_{t-1} + ... + \theta_q\epsilon_{t-q}\]

ARIMA

ARIMA is composed of an auto-regressive model with differenced data, and a moving average model, added together. Fitting the model involves choosing the order of each of the three components, and then learning the parameters. \[y′t=c+\phi_1y′_{t−1} + ... + \phi_p y′_{t−p} + \theta_1\epsilon_{t−1} + ... + \theta_q\epsilon_{t-q} + \epsilon_t\] The model’s three hyperparameters:

  • p: order of the auto-regressive part.
  • d: degree of first differencing.
  • q: order of moving average.

These parameters are commonly written with the syntax ARIMA(p,d,q).

The method I am using uses an ARIMA model along with the STL model learned earlier to forecast. The ARIMA parameters chosen by the automated learning algorithm are displayed in the title of the plot.

These results aren’t very surprising, but here they are! The model is able to replicate the strong seasonal trend that has been historically observed. Note that the error bars of the forecasts increase with time – this is a feature of the ARIMA model.

Conclusion

The eco-totem data describes a clear and strong seasonal trend in bicycle usage in Madison. The data does not indicate that there is a broader increase or decrease in cycling over the duration of the sample. This conclusion is the most interesting to me – I thought that it was likely some increase in usage would be detectable in the data, if only due to long term effects like population increase. But within the least 5-ish years no trends like this are present.

The effect of students arriving on campus in the fall does not appear to affect bike traffic as much as I would’ve expected, though I did not formally investigate this effect.

The data further shows how the daily weather is a likely cause of a large amount of the daily variation around the seasonal trend, an effect that I did not model. This is a direction that further research could be taken.

The seasonal trend can be used to forecast future path usage simply and easily. Wth further tuning, and adding covariates such as weather data, the forecasting model’s accuracy could be improved significantly.


  1. In my professional opinion, this name is kinda dumb↩︎

  2. https://cityofmadison.com/news/madison-installs-first-visual-electronic-bicycle-counter↩︎

  3. https://data-cityofmadison.opendata.arcgis.com/datasets/eco-totem-southwest-path-bike-counts/explore↩︎

  4. https://data-cityofmadison.opendata.arcgis.com/datasets/eco-totem-capital-city-trail-bike-counts/explore↩︎

  5. STL stands for “Seasonal and Trend decomposition using Loess” and was developed in R. B. Cleveland, Cleveland, McRae, & Terpenning (1990) https://www.wessa.net/download/stl.pdf↩︎