Time series Analysis (Part I), the basics
This blog post effectively breaks my blog writing dry streak. The last offical blog post (not stats related) was on my old blog platform (link) 1170 days ago! This post was motivated by a recent LinkedIn post by Joachim Schork about breaking down time series (see below and here) and the comments from the larger community. Its also motivated by a recent spur of time series decomposition analyses I’ve seen of late during meetings and discussions with colleagues.
Time series data
Definition
First, what is time series data? If you look in the dictionary it will say something along the lines of a series of values/measurements/observations obtained at successive times, often (but not always) with equal intervals between them. Simply put its data collected over time. In my corner of science that could mean daily water levels, weekly total phosphorus concentrations, annual seagrass coverage, etc. Once collected these data can analyzed in a variety of ways depending on the motivation of why and where its being collected. Again, in my realm of the science, something we are interested in is the change in conditions overtime for a variety of reasons including (but not limited to) climate change, landscape scale changes (i.e. land-use alterations, dam removal, stream diversion, etc.), restoration activities, forecast modeling, etc. In this case, a time series analysis is needed to see how things are changing overtime.
Formatting
When handling time series data in R
you can either handle the data as a data.frame
(or tibble
if you are a tidyverse person … I personally am not, no judgement) or a ts
object.
Generate some example value
set.seed(123)
<- runif(5)|>
values round(2)
data.frame
ts
data.frame(Yr = 2012:2016,
value = values
)
Yr value
1 2012 0.29
2 2013 0.79
3 2014 0.41
4 2015 0.88
5 2016 0.94
ts(values, start = 2012)
Time Series:
Start = 2012
End = 2016
Frequency = 1
[1] 0.29 0.79 0.41 0.88 0.94
There are pros and cons of formatting the data either way. Most time series analysis functions can handle both in someway but most like it to be a ts
object. The way the ts
function works is it essentially converts the data into a special kind of data matrix (in R
its called class or object) with an added header containing some information about the time series (like the example above). Depending on the information you include in the ts
function it makes some assumptions, for instance with frequency = 12
it assumes its monthly data or frequency = 365
assumes daily.
rnorm(24)|>
round(2)|>
ts(frequency = 12)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 -1.69 1.24 -0.11 -0.12 0.18 1.28 -1.73 1.69 0.50 2.53 0.55 0.24
2 -1.05 1.29 0.83 -0.06 -0.78 -0.73 -0.22 -0.33 -1.09 -0.09 1.07 -0.15
See ?ts
for more details on specifics. A very big drawback when using ts
is if you are working with daily data with a leap year or years with leap years mixed in ts
doesn’t know how to handle that extra day since the input is a vector or list of data. There might be a way to coerce it but I’ve yet to figure it out. Which is why I prefer to work in the data.frame
world whenever possible.
Analyses
Most time series data exhibit some pattern, think about rainfall, climate cycle (ENSO), greenness of leaves, how many miles you drive/walk per unit time (i.e. week, month, etc.), etc. Analytically, these patterns can be split into components that make up the overall time series. Usually when talking about time series data terms like “trend” and “seasonality” come up. Obviously these terms have very specific meanings.
Generally “trend” is defined as some change in the data over time. The most basic statistical analysis for time series data is the Kendall trend test. This test evaluates a time series to determine if the data is monotonically changing (increasing up or down) overtime based on the Kendall-𝜏 coefficient. Statistical significance and the rate of change are also things to evaluate. More on that later. While the Kendall test is looking at monotonic linear trends, not all trends have to be “linear”. I have another blog post planned (I hope) for that, stay tuned!
Seasonal or seasonality is another important component for most time series analyses but not all. For instance if you have annual or decadal data, you won’t be looking at seasonality as those time steps aggregate across those periods. Seasonality, as in when the seasons occur is assumed to be fixed and at a known frequency (i.e. spring, summer, etc.).
Another term or component is cyclic or random. This is when the data show some fluctuations that don’t occur at a fixed frequency but occurs throughout the data set.
Synthetic Data
As an example this post (and the following series) rather than grabbing some random datasets as an example I wanted to use something that we have some knowledge on before digging into the specifics. This includes some of the components discussed above. To achieve this I put together some basic functions to simulate some data. The first function simulate_timeseries
is relatively basic and simulate_timeseries_vol
is a little more complex that includes some volatility and randomness factors.
<- function(n = 1000, # Number of observations
simulate_timeseries trend_slope = 0.01, # Slope of linear trend
seasonal_amp = 2, # Amplitude of seasonal component
seasonal_period = 12, # Period length
noise_mean = 1, # mean of noise
noise_sd = 0.5, # Standard deviation of noise
seed.val = 123 # value to set.seed(...)
){ if(is.null(seed.val)==F){set.seed(seed.val)} # for reproducibility
# Generate time points
<- 1:n
t # Create components
# 1. Linear trend
<- trend_slope * t
trend # 2. Seasonal component using sine wave
<- seasonal_amp * sin(2 * pi * t / seasonal_period)
seasonal # 3. Random noise (stationary component)
<- rnorm(n, mean = noise_mean, sd = noise_sd)
noise # Combine components
<- trend + seasonal + noise
ts_data
# Convert to time series object
<- ts(ts_data, frequency = seasonal_period)
ts_result # Return both the time series and its components for analysis
return(list(
timeseries = data.frame(time = t, value = as.numeric(ts_result)),
components = list(
trend = trend,
seasonal = seasonal,
noise = noise
)
))
}
<- function(n = 1000, # Number of observations
simulate_timeseries_vol trend_slope = 0.01, # Slope of linear trend
seasonal_amp = 2, # seasonal component
seasonal_period = 12, # Period length
init_vol = 0.5, # Initial volatility
vol_persistence = 0.95,# Persistence in volatility
rw_sd = 0.1, # Random walk innovation SD
seed.val = 123 # value to set.seed(...)
){ if(is.null(seed.val)==F){set.seed(seed.val)} # for reproducibility
# Generate time points
<- 1:n
t # 1. Create non-linear trend (combining linear trend with random walk)
<- trend_slope * t
linear_trend <- cumsum(rnorm(n, 0, rw_sd))
random_walk <- linear_trend + random_walk
trend # 2. Create time-varying seasonal component
# Amplitude changes over time following a random walk
<- seasonal_amp + cumsum(rnorm(n, 0, rw_sd/5))
varying_amplitude <- varying_amplitude * sin(2 * pi * t / seasonal_period)
seasonal # 3. Generate time-varying volatility (GARCH-like process)
# Initialize volatility vector
<- numeric(n)
volatility 1] <- init_vol
volatility[# Generate volatility process
for(i in 2:n) {
# Volatility follows AR(1) process with innovations
<- sqrt(0.01 +
volatility[i] * volatility[i-1]^2 +
vol_persistence 1 - vol_persistence) * rnorm(1, 0, 0.1)^2)
(
}# 4. Generate heteroskedastic noise
<- rnorm(n, 0, 1) * volatility
noise # 5. Add structural breaks
# Add random level shifts
<- max(1, round(n/200)) # Approximately one break every 200 observations
n_breaks <- sort(sample(2:n, n_breaks))
break_points <- cumsum(rnorm(n_breaks, 0, 2)) # Random shift magnitudes
level_shifts <- numeric(n)
breaks <- 1
current_break for(i in 1:n) {
if(current_break <= length(break_points) && i >= break_points[current_break]) {
:n] <- level_shifts[current_break]
breaks[i<- current_break + 1
current_break
}
}
# Combine all components
<- trend + seasonal + noise + breaks
ts_data # Convert to time series object
<- ts(ts_data, frequency = seasonal_period)
ts_result # Return both the time series and its components
return(list(
timeseries = data.frame(time = t, value = as.numeric(ts_result)),
components = list(
trend = trend,
seasonal = seasonal,
noise = noise,
breaks = breaks,
volatility = volatility
)
)) }
Here are some example simulated datasets. Lets assume this data is monthly for a 15 year period of record.
<- 15*12 # 15 years 12 months per year
n.vals <- simulate_timeseries(
dat1 n = n.vals, # About 15 years of monthly data
trend_slope = 0.025, # Upward trend
seasonal_amp = 2, # Seasonal fluctuation of ±2
seasonal_period = 12, # Monthly seasonality
noise_sd = 0.5, # Moderate noise
noise_mean = 3
)## add some years and months
$timeseries <- cbind(dat1$timeseries,
dat1expand.grid(Mon = 1:12, Yr = 1990:2004)
)
<- simulate_timeseries_vol(
dat2 n = n.vals, # About 15 years of daily data
trend_slope = 0.05, # Upward trend
seasonal_amp = 2, # Base seasonal amplitude
seasonal_period = 10, # Monthly seasonality
init_vol = 0.5, # Initial volatility
vol_persistence = 0.65,
rw_sd = 0.3
)## add some years and months
$timeseries <- cbind(dat2$timeseries,
dat2expand.grid(Mon = 1:12, Yr = 1990:2004)
)
<- simulate_timeseries(
dat3 n = n.vals, # About 15 years of monthly data
trend_slope = 0.001, # no trend
seasonal_amp = 2, # Seasonal fluctuation of ±2
seasonal_period = 12, # Monthly seasonality
noise_sd = 0.5, # Moderate noise
noise_mean = 3
)
<- simulate_timeseries(
dat4 n = n.vals, # About 15 years of monthly data
trend_slope = -0.05, # downward trend
seasonal_amp = 2, # Seasonal fluctuation of ±2
seasonal_period = 12, # Monthly seasonality
noise_sd = 0.5, # Moderate noise
noise_mean = 15
)
Here is a quick plot of the data …
Basic Trend Analysis
A simple trend test would look like this … For simplicity sake we are going to use time
as its numeric but hypothetically decimal date (see lubridate::decimal_date
for more info) is also an option.
cor.test(dat1$timeseries$time,dat1$timeseries$value,method = "kendall")
Kendall's rank correlation tau
data: dat1$timeseries$time and dat1$timeseries$value
z = 8.7492, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.438982
cor.test(dat2$timeseries$time,dat2$timeseries$value,method = "kendall")
Kendall's rank correlation tau
data: dat2$timeseries$time and dat2$timeseries$value
z = 11.996, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.6018622
Based on the plots you would have guessed that the datasets had an increasing trend. But here is what no-trend (dat3
) and negative trend (dat4
) looks like.
cor.test(dat3$timeseries$time,dat3$timeseries$value,method = "kendall")
Kendall's rank correlation tau
data: dat3$timeseries$time and dat3$timeseries$value
z = -0.24248, p-value = 0.8084
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.01216636
cor.test(dat4$timeseries$time,dat4$timeseries$value,method = "kendall")
Kendall's rank correlation tau
data: dat4$timeseries$time and dat4$timeseries$value
z = -13.708, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.6877716
Sometimes we are more interested in long term trends … be careful when using “long” and “short” these are usually very context specific and care should be taken to delineate/define as much as possible. For sake of brevity lets look at an annual trend analysis.
First, the data needs to be aggregated:
<- aggregate(value~Yr,dat1$timeseries,mean) dat1_yr
You can also look at the annual variability or standard deviation using aggregate
.
aggregate(value~Yr,dat1$timeseries,sd)
Yr value
1 1990 1.631108
2 1991 1.674150
3 1992 1.277267
4 1993 1.359000
5 1994 1.481055
6 1995 1.197387
7 1996 1.442100
8 1997 1.458378
9 1998 1.646581
10 1999 1.457668
11 2000 1.455360
12 2001 1.447669
13 2002 1.428132
14 2003 1.460503
15 2004 1.533227
Or do something more elaborate to calculate all sorts of statistics. My preference (again not tidyverse) is functionality in the plyr
package.
library(plyr)
ddply(dat1$timeseries,c("Yr"),summarise,
mean.val = mean(value),
sd.val = sd(value),
var.val = var(value),
N.val = length(value), #or AnalystHelper::N.obs(value)
SE = AnalystHelper::SE(value),
med.val = median(value),
min.val = min(value),
max.val = max(value))|>
round(2)# added to consolidate the table down
Yr mean.val sd.val var.val N.val SE med.val min.val max.val
1 1990 3.26 1.63 2.66 12 0.47 3.61 0.84 5.85
2 1991 3.36 1.67 2.80 12 0.48 3.03 0.99 6.03
3 1992 3.85 1.28 1.63 12 0.37 4.20 1.92 6.09
4 1993 4.02 1.36 1.85 12 0.39 3.96 1.86 5.82
5 1994 4.48 1.48 2.19 12 0.43 4.82 1.65 6.40
6 1995 4.60 1.20 1.43 12 0.35 4.42 2.99 6.41
7 1996 4.96 1.44 2.08 12 0.42 4.88 3.03 7.14
8 1997 5.44 1.46 2.13 12 0.42 5.46 3.30 7.72
9 1998 5.48 1.65 2.71 12 0.48 5.27 3.15 7.95
10 1999 5.75 1.46 2.12 12 0.42 5.66 3.90 7.94
11 2000 6.20 1.46 2.12 12 0.42 6.26 3.74 8.05
12 2001 6.29 1.45 2.10 12 0.42 6.56 4.05 8.70
13 2002 6.70 1.43 2.04 12 0.41 6.46 4.61 8.78
14 2003 7.18 1.46 2.13 12 0.42 6.97 4.92 9.46
15 2004 7.42 1.53 2.35 12 0.44 7.79 5.12 9.17
Once the data is aggregated we can use the simple Kendall test to look at the overall annual trend in the data.
cor.test(dat1_yr$Yr,dat1_yr$value,method = "kendall")
Kendall's rank correlation tau
data: dat1_yr$Yr and dat1_yr$value
T = 105, p-value = 1.529e-12
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
1
Looks like a perfect increasing trend! The Kendall-𝜏 is a type of correlation coefficient so the values should range between -1 and 1. Based on the 𝜌-value, (assuming an 𝛼-value of 0.05) this is a statistically significant trend. (Lets not get into the 𝜌-value discussion!)
We can also look at the rate of change. The most common is the Theil-Sen slope estimator. There are several different packages out there that can estimate Theil-Sen slope but the most basic one that I’ve come across is the median based linear model or mblm
package. We can discuss the basics of Theil-Sen and mblm in a later post but after some digging and search what I’ve determined is that Theil-Sen method and median-based linear model are closely related. Generally, the Theil-Sen method is a non-parametric method that calculates the median of slopes of all lines through the data and the intercept is often the median of the residuals after fitting the slope. Meanwhile, the median-based linear model is a regression model that uses medians as a criterion for fitting a model and minimizes the sum of absolute residuals. In most cases, the Theil-Sen slope and the slope determined by the median-based linear model are the same with minuscule differences. Most trend routines use median-based linear models or quantile regressions as a basis to perform Theil-Sen estimation.
If you run the side-by-side of mblm
, zyp
and a separate Theil-Sen estimator pulled from the litature (found here) you’ll undoubtly come to similar, if not the same values. Don’t believe me lets give it a try.
library(zyp)
library(mblm)
# From University of Virigina function (from literature)
<- function(x,y){
theil_sen <- length(x)
n <- (n * (n - 1)) / 2
max_n_slopes <- vector(mode = 'list', length = max_n_slopes) # list of NULL elements
slopes <- 1
add_at # Calculate point-to-point slopes if x_i != x_j
for (i in 1:(n - 1)) {
<- lapply((i + 1):n, function(j)
slopes_i if (x[j] != x[i]) { (y[j] - y[i]) / (x[j] - x[i]) })
:(add_at + length(slopes_i) - 1)] <- slopes_i
slopes[add_at<- add_at + length(slopes_i)
add_at
}# Calculate Theil-Sen slope
<- unlist(slopes) # drop NULL elements
slopes <- median(slopes, na.rm = TRUE)
theil_sen_slope # Calculate Theil-Sen intercept
<- y - (theil_sen_slope * x)
intercepts <- median(intercepts, na.rm = TRUE)
theil_sen_intercept # Return
c('Theil-Sen intercept' = theil_sen_intercept,
'Theil-Sen slope' = theil_sen_slope)
}
set.seed(123)
# Generate linear data
<- 1:100
x <- 2*x + rnorm(100, 0, 10)
y # Add some outliers
c(10, 30, 50, 70, 90)] <- y[c(10, 30, 50, 70, 90)] + 50
y[
plot(y~x,las=1)
<- zyp.sen(y~x); # zyp package
zyp.sen <- mblm(y~x,repeated = F); # mblm package
mblm.sen <- theil_sen(x,y); # University of Virigina function (from literature) UV.sen
Pack intercept Slope
1 zyp 0.40106 2.01976
2 mblm 0.40106 2.01976
3 UV 0.40106 2.01976
There you have it, all three methods produce the same estimates.
Back to our aggregated dataset, the code is pretty straight forward but there are a couple of things to be aware. If you’ve fit linear models using lm
its the same general format. However, currently the mblm
function can’t handle datasets with NAs
(unlike lm
), so you might have to clean up the data a little (i.e. na.omit()
can be your friend). Also double check that both variables are numeric.
library(mblm)
<- mblm(value~Yr,dat1_yr,repeated=F)
dat1.theilsen
dat1.theilsen
Call:
mblm(formula = value ~ Yr, dataframe = dat1_yr, repeated = F)
Coefficients:
(Intercept) Yr
-587.9833 0.2971
# extrapolate example
predict(dat1.theilsen,data.frame(Yr=2005:2014),interval ="confidence")
fit lwr upr
1 7.629230 7.500516 7.757944
2 7.926294 7.784963 8.067624
3 8.223357 8.069143 8.377572
4 8.520421 8.353116 8.687725
5 8.817484 8.636929 8.998040
6 9.114548 8.920614 9.308482
7 9.411611 9.204195 9.619028
8 9.708675 9.487692 9.929658
9 10.005739 9.771119 10.240358
10 10.302802 10.054488 10.551116
The Theil-Sen estimator for the annually aggregated data is 0.297 units per year. This is the rate of annual change. If this was something like sea-level rise, rainfall rate, population growth, etc. you can do a very basic analysis and ask at this rate what would the values be in 10 years (extrapolating … 10.3 units) or in the case of sea-level rise whats the rise per decade (2.97 units per decade)?
Whats next?
To avoid making this blog post a novel, I’m going to cut this post short. We went over the basics of time series, what they are/how they are defined, what it looks like in the R
environment, generating some synthetic time series for examples and a quick overall trend test. This post kicks off a series of posts that will delve into other aspects of time series analyses such as trend analysis data requirements, seasonal trend analyses, stationary, periodicity, time series decomposition and an alternative to time series decomposition.
- Session info ---------------------------------------------------------------
setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
ctype English_United States.1252
tz America/New_York
date 2024-12-08
pandoc 2.19.2
quarto 1.3.450
- Packages -------------------------------------------------------------------
package * version date (UTC) lib source
AnalystHelper * 0.1.5 2024-10-31 [] local
Kendall * 2.2.1 2022-03-20 [] CRAN (R 4.1.3)
mblm * 0.12.1 2019-01-26 [] CRAN (R 4.1.1)
plyr * 1.8.8 2022-11-11 [] CRAN (R 4.1.3)
sessioninfo * 1.2.2 2021-12-06 [] CRAN (R 4.1.3)
zyp * 0.11 2023-01-26 [] CRAN (R 4.1.3)
------------------------------------------------------------------------------