Longitudinal data analysis

Steve Simon


This page is currently being updated from the earlier version of my website. Sorry that it is not yet fully available.

[This is a very early draft]

Longitudinal data are data where each patient is observed on multiple occasions over time. Analysis of longitudinal data are challenging because measurements on the same subject are correlated. Another way to think about this is that two measurements on the same subject will have less variation than two measurements on different subjects.

A closely related concept is the cluster design. A cluster design is one where the researcher selects clusters of patients rather than selects patients individually. For example, a researcher might randomly select several families and evaluate all children in that family. As another example, a researcher might randomly select several clinical practices and then evaluate a random group of patients at each practice. In a cluster design, two measurements on patients within the same cluster will have less variations than measurements of two patients in differing clusters. . In genetics, this correlation is of great interest, and can help you understand concepts like heritability.

Many of the methods described below for longitudinal designs would also be useful for cluster designs. For simplicity, I will discuss these methods solely from the perspective of a longitudinal design.

If your data are continuous, then there are several “classical” approaches such as multivariate analysis of variance and repeated measures analysis of variance. These approaches work well for simple well structured longitudinal data.

An alternative is to use mixed linear models. These models handle missing data well and can handle situations where the times of measurement vary from one patient to another.

In a mixed linear model, you specify a particular structure for the correlations. For example, an autoregressive structure is commonly used to represent structure where correlations are strongest for measurements close in time and which become weaker for measurements that are further separated in time.

In many situations, the correlations are not of direct interest, but we only account for them because failure to do so will lead to incorrect inferences.

When you are examining the correlation structure, a statistic called the Akaike Information Criteria (AIC). This statistic measures how closely the model fits the data, but it includes a penalty for overly complex models.

Unfortunately, there are two different formulas for AIC. For one formula, a large value of AIC is good, and for the other formula, a small value is good.

AIC values should only be compared for models where the only change is in the correlation structure. It would not make sense to compare an AIC from a model with linear relationships to a model with quadratic relationships.

What if your data is not continuous? L. Fang discussed some of the approaches commonly used when the data represents binomial counts.

Further reading

You can find an earlier version of this page on my original website.