The Difference Between Clustered, Longitudinal, and Repeated Measures Data

What is the difference between Clustered, Longitudinal, and Repeated Measures Data?  You can use mixed models to analyze all of them. But the issues involved and some of the specifications you choose will differ.

Just recently, I came across a nice discussion about these differences in West, Welch, and Galecki’s (2007) excellent book, Linear Mixed Models.

It’s a common question. There is a lot of overlap in both the study design and in how you analyze the data from these designs.

West et al give a very nice summary of the three types. Here’s a paraphrasing of the differences as they explain them:

  • In clustered data, the dependent variable is measured once for each subject, but the subjects themselves are somehow grouped (student grouped into classes, for example). There is no ordering to the subjects within the group, so their responses should be equally correlated.
  • In repeated measures data, the dependent variable is measured more than once for each subject. Usually, there is some independent variable (often called a within-subject factor) that changes with each measurement.
  • In longitudinal data, the dependent variable is measured at several time points for each subject, often over a relatively long period of time.

A Few Observations

West and colleagues also make the following good observations:

1. Dropout is usually not a problem in repeated measures studies, in which all data collection occurs in one sitting.  It is a huge issue in longitudinal studies, which usually require multiple contacts with participants for data collection.

2. Longitudinal data can also be clustered.  If you follow those students for two years, you have both clustered and longitudinal data.  You have to deal with both.

3. It can be hard to distinguish between repeated measures and longitudinal data if the repeated measures occur over time.  [My two cents:  A pre/post/followup design is a classic example].

4. From an analysis point of view, it  doesn’t really matter which one you have.  All three are types of hierarchical, nested, or multilevel data. You would analyze them all with some sort of mixed or multilevel analysis.  You may of course have extra issues (like dropout) to deal with in some of these.

My Own Observations

I agree with their observations, and I’d like to add a few from my own experience.

1. Repeated measures don’t have to be repeated over time.  They can be repeated over space (the right knee gets the control operation and the left knee gets the experimental operation). They can also be repeated over condition (each subject gets both the high and low cognitive load condition.  Longitudinal studies are pretty much always over time.

This becomes an issue mainly when you are choosing a covariance structure for the within-subject residuals (as determined by the Repeated statement in SAS’s Proc Mixed or SPSS Mixed).  An auto-regressive structure is often needed when some repeated measurements are closer to each other than others (over either time or space).  This is not an issue with purely clustered data, since there is no order to the observations within a cluster.

2. Time itself is often an important independent variable in longitudinal studies, but in repeated measures studies, it is usually confounded with some independent variable.

When you’re deciding on an analysis, it’s important to think about the role of time.  Time is not important in an experiment, where each measurement is a different condition (with order often randomized).  But it’s very important in a study designed to measure changes in a dependent variable over the course of 3 decades.

3. Time may be measured with some proxy like Age or Order.  But it’s still really about time.

4. A longitudinal study does not have to be over years.  You could be measuring reaction time every second for a minute.  In cases like this, dropout isn’t an issue, although time is an important predictor.

5. Consider whether it makes sense to think about time as continuous or categorical.  If you have only two time points, even if you have numerical measurements for them, there is no point in treating it as continuous.  You need at least three time points to fit a line, but more is always better.

6. Longitudinal data can be analyzed with many statistical methods, including structural equation modeling and survival analysis.  You only use multilevel modeling if the dependent variable is measured repeatedly and if the point of the model is to see how it changes (or differs).

Naming a data structure, design, or analysis is most helpful if it is so specific that it defines yours exactly. Your repeated measures analysis may not be like the repeated measures example you’re trying to follow. Rather than trying to name the analysis or the data structure, think about the issues involved in your design, your hypotheses, and your data. Work with them accordingly.

Go to the next article or see the full series on Easy-to-Confuse Statistical Concepts

 

Random Intercept and Random Slope Models
Get started with the two building blocks of mixed models and see how understanding them makes these tough models much clearer.

Reader Interactions

Comments

  1. Marjoleine says

    The post got me thinking about the modelling of a flux, i.e. CO2-emissions by microbes in soil as a function of time. For this, several exponential models are being used, along with statistical features as R2adj., AIC, RMSE. I would think that autocorrelation is not an issue here, but rather the essence of the analysis. Is that right? What other statistic could be used to differentiate between models, in particular for the prediction at a fixed time point?

  2. Nik says

    Hi Karen,

    Thanks very much for all of your posts; they really do clarify so many issues that tend to be left unsaid in this sphere! I was really hoping whether you might be able to shed some light on a problem I’ve been struggling with.

    [Trying to keep it brief]… I have a 3-visit baseline-pre-post intervention dataset where my primary question is to evaluate how cell-based treatment affects a single DV as compared to no treatment in patients with a degenerative illness. This single DV was measured twice at each timepoint per individual, from the same body part but on opposite sides of the body. The amount of time between each timepoint was variable. As opposed to a perhaps classical design, the treatment was administered after the 2nd timepoint, hence in this group we would expect symptoms to become worse between the first two visits, but better by visit 3. In addition, individuals entered the study at variable times after their diagnosis, and we expect this to be related to decline in the DV.

    I had initially tried using a linear mixed effects model (random intercepts and slopes), since this could take into account the effect of time (using duration of illness as a proxy) and the time between visits. However, because I expect the DV to decline before ascending following the 2nd timepoint in the treatment group, I only used the 2nd and 3rd timepoints for that group. For the control group, I retained all visits. However, I struggled with singularity and convergence problems. In addition, I was informed that LME’s require at least 3 time-points per individual. I am slightly confused at this because I was also informed that you can (and should) include data with only 1 timepoint (i.e. incomplete cases – which I also have). Would you be able to shed some light?

    I then moved on (at the request of my supervisor and a statistician) to using an ANCOVA with repeated measurements and Body Side as within-subject factors and Treatment as a between-subjects factor (…and time since diagnosis [at baseline] as a covariate). But in doing so I am having to remove more than 50% of my data to include only complete cases and I’m feeling as if I’m far from adequately taking into account the effect of time and time since diagnosis [as it is modelled here as time-invariant].

    I actually feel like I’m losing my mind, and am in the same situation as Hannah above, with no or AWOL or exhausted avenues of help to pursue. If you could shed at least a little bit of light I would be eternally grateful.

    All the best,

    Nik

    P.S. I’m so sorry for the length of this post..

    • Karen Grace-Martin says

      Hi Nik,

      This stuff gets really complicated (as you’ve seen). You don’t always need three time points per individual, but there are situations you do (like if time is numerical, not categorical).

      We’d be happy to help you, but we’d have to have a conversation. Please take a look at our membership or consulting services.

  3. Sherly Meilianti says

    Hi Karen,

    Thank you very much for this very helpful article.
    I would like to ask your opinion about my research. So, basically, I measured one continuous variable in many countries around the world. I measured it in 4 different points of time. My interest is actually to predict the value of that continuous variable in the future (let’s say in 2020). I believed this will be repeated measure, am I right? After reading your article, I am just a bit confused between repeated measure and longitudinal data. I analysed my data using Linear Mixed Model. Will it be the right analysis to do?
    Thank you very much for your help.

  4. E says

    Hi Karen,

    Thank you for the great article. I’m stepping into role where my predecessor described repeating the same variable (e.g., quesiton, series of questions) across time and different -though similar- populations as a “longitudinal study”.

    Specifically, learners are presented with a question in an education. 6 months to a year later another education is released (to whoever wants to take it though some of the original learners may pull through), and that original question is present in the new education.

    Those and successive values were then plotted and presented as longitudinal… as far as I can see, percent change and independent samples t tests were run to identify whether the observed difference met significance. I know this method of testing is wildly incorrect…. but can the framework of this “operational definition of a longitudinal analysis” (the same question or questions repeated across time to different learners) actually be used i.e., with a mixed model?

  5. Lynda says

    Hi Karen,
    I’ve been thinking about longitudinal/repeated measures design and was wondering which one would be better for a design where you might have a control and an experimental group (e.g. no intervention and intervention) and an outcome measured at three different time points (before they received the interventions at age 2, right after they finished the intervention at age 5 and three years after the finished the intervention at age 7). The outcome would be something like language skills (continuous). Would a growth curve model (multilevel modeling/hierarchical modeling) analysis be appropriate? What would the difference be with a repeated measures ANOVA? Thank you!

    • Karen says

      Hi Lynda,

      Yes, I would use the growth curve model. You will get exactly the same results from RM anova only if there is no missing data (which is pretty rare in real data sets).

  6. Oscar says

    Hi Karen,
    Please I am to model monthly maternal mortality data. I consider the subjects to be some selected regions hospitals and variables to be monthly , that is January to December. Which model can I used. I know this is purely counts data. Pls is it clustered, logitutidinal or repeated measurement? Should I use poisson regression model or generalized estimating equations. Thank u

  7. Jan says

    Hi,
    I want to run an longitudinal analysis to show a causal relationship between obesity and later educational attainment/income/deprivation. I know that I should adjust for these outcomes at baseline (time 1) but I’m unsure what is the best model to use. My outcome is categorical (but could be assumed continuous), my IV is categorical also (normal weight, overweight, obese) but I could treat this as continuous BMI.

    Could anyone help? I have multiple time points and the same variables at each time point.

    Thanks,

    Jan

  8. Marcos says

    Hi Karen,

    My question is about how to apply panel-data models to analyze longitudinal data. My dataset includes patients for whom I collected clinical data (Infection, Gender and Age) and bacterial abundances as proportions over 10 time points. These patients started all heathy (time points 1 to 5) until they caught a viral infection (time points 6 to 10).

    I want to assess if variation in proportions for each bacteria (separately) is significantly associated with infection status (yes/no) while accounting for confounders (gender and age). To do so I thought of estimating odds ratios for each bacteria using generalized estimating equations with logistic regression with unstructured correlation and robust standard errors to take into account samples from the same subject and confounders.

    My concern with this analysis is that I’m using the same patients to compare infected (time points 1 to 5) and non-infected samples (time points 6 to 10) instead of comparing infected to non-infected patients across time points 1 to 10, but unfortunately I don’t have that kind of data. Consequently, time points between patients do not overlap. Is my statistical approach correct for my question and if so, is non-overlapping time points a problem? If my analysis is not appropriate, do you have any recommendation?

    The second goal of this study is slightly different. I want to assess if the proportions of each bacteria are significantly different between healthy (samples from time points 1 to 5) and infected samples (samples from time points 5 to 6) across all patients. I also want to account for the same cofounders (gender and age) and take into account that samples within patients are not independent. What would be the right statistical method to apply here?

    Thanks in advance for your help and advice. Let me know if you have any questions.

    Sincerely,

    Marcos

  9. Elainey says

    Hi Karen,
    I am doing a study on some data which tries to predict a % measurement with respect to time, injury, gender etc. My errors are not normally distributed. Hence I look at using a GEE. I have 3 questions:
    1) How do you tell whether the variable is significant when just given naïve/robust standard errors?
    2) How do you identify the marginal distribution?
    3) How do you compare which GEE models are better? You can’t use AIC, and I have read that we use QIC?
    Thanks

  10. Manon says

    Hi Karen,

    thanks for all the great information on your website. However I am still doubting how to analyse my data, which consists of cross-sectional measurements in subsequent years, each year among 2-,3-,5-,10, and 14 year olds. As I have data over 4 years, part of the subjects are measured twice (eg a 2-year old in 2009 was measured again as 3-year old in 2010, and a 3-year old in 2009 was measured as 5-year old in 2011), but part are single measurements (eg. 10 and 14 year olds are all measured once). Could I use mixed models?
    Thanks!

  11. Tina Birgitte says

    Hi Karen.

    I’m in the midst of finishing a longitudinal research project on child and adolescent suicide risk assessment. There are two IV (severity of suicdal ideation and intensity of suicidal ideation) and one DV (Number of suicide attempts). In the sample there are roughly 100 patients. By know I have collected T1 but I still need T2 and T3. The problem is, that the datapoints will be unevenly spaced for all individuals (non-similar time intervals). Is there a statistical approach which solves that problem?

  12. pat says

    HI!! 😉

    i will make it quick.
    A study of weight gain, where investigators randomly assigned 30 rats to three treatment groups: treatment 1 is control, treatment 2 is thiouracil and treatment 3 is thyroxin. The treatment is added to the rats drinking water. Weight is measured at baseline (week 0), and at week 1, 2, 3 & 4.
    The data are in “wide” format.
    Data looks like:
    ID Treat week0 week 1 week2 week3 week4

    The aim: to assess howbthe two additives affect the weight gain of the rats.
    Notes: due accident occured during experiment, the data is unbalanced.
    Is this repeated measured data, or longitudinal clustered data?

    Thank You

    • Karen says

      Hi Pat,

      It’s definitely repeated measures. It wouldn’t be incorrect to call it longitudinal, but it probably won’t have issues of dropout that most longitudinal studies do. As for clustering, you haven’t mentioned anything that indicates clustering. If the rats are grouped into cages or litters, or something like that, then there would be clustering.

  13. pat says

    Aim of analysis: to assess how the two additive affect the weight gain of the rats.
    Note: due accident during experiment occured, the data is unbalanced

  14. Hannah says

    I am new to repeated measures. I have found the more I read the more I get confused on how best to analyze my data. I will try and be short and sweet. Here is what I’ve got: an long-term observational study assessing the influence of aquaculture on sea duck relative abundances over time. My whole study area has been partitioned to 259 1-minute grid cells. Data was pooled per grid cell per year for 19 years for a total of 4,921 observations (however, not all grid cells provide 19 years of observation, so there are actually only 4,266). Recorded at each observation are: relative species abundances of 6 bird species groups, total acres aquaculture, and further split to acres cultivated of each of 4 shellfish species. I want to address how aquaculture acreage is influencing bird relative abundances? Specifically how does growth in aquaculture acreage influence abundances over time? Further are there differences seen by bird species group and/or species cultured?

    Is there such a thing as a longitudinal study of repeated measures data?

    Any direction or advice would be greatly appreciated!

    • Karen says

      Hi Hannah,

      It sounds like this is definitely longitudinal and possibly spatial as well, depending on how close together the observational grid points are. Is the repeat you’re referring to the spatial repeat or the fact that you have 6 species.

      It also sounds like it might be count data, which would indicate a Negative binomial model.

      This is both the beauty and the curse of these kinds of models–they can accommodate many designs, but get complicated really fast. I would honestly suggest talking with someone experienced in mixed models first to get a good idea of an appropriate analysis, then start reading as much as you can. This will be very tricky to figure out on your own just from reading. The devil is definitely in the details.

      • Hannah says

        Hi Karen,

        Thank you for getting back to me! Your last few sentences made me chuckle because unfortunately, I do not have any resources to discuss the matter with and have resorted to figuring it out on my own just from reading.
        My observation sites are adjoining shoreline polygons generated in GIS of roughly equal size (1000km2). The repeat is referring to the fact that I am summarizing and analyzing observation counts of my species groups and aquaculture acreage within the same study site every year.

        I believe a key factor in my analysis will be to appropriately determine random effects so I can assess both spatial and temporal variability of sea duck populations in response to aquaculture.

        I have requested your webinar recording on fixed and random effects and I am waiting on the email link. Hopefully that should shed some more light on the subject.

        Again, thank you so much for your response and providing such helpful resources!

  15. Buddhi says

    Hi!
    Thank you for the reply.
    It is basically about methods of analyzing binary and categorical repeated measures data.
    I am looking forward to compare some of the methods available also. Such as PROC CATMOD in SAS, PROC GENMOD of SAS and etc.

  16. Buddhi says

    Hi !
    After reading your article, some of my doubts about repeated measures data and longitudinal data, did resolve.
    According to my opinion, theories for normally distributed repeated measures are well developed, compared to the case for binary and categorical repeated measures data.
    I am really looking forward to carry out a study on this area. I would be much obliged if you could provide me any advise on essential reading materials that I must follow and best way to do the study.
    Thank you.

    • Karen says

      Hi Buddhi,

      Thanks. Glad it helped. You are correct about normal data having more well-developed theory.

      What kind of study are you doing? If you tell me more I can tell you better what to read.

  17. Yann says

    Hi Karen,
    I studied the concentration of a blood biomarker induced by the ingestion of a xenobiotic in rats. I used 2 doses of the said xenobiotic (1 group of 18 rats per dose) and I sacrified 3 rats of each group at day 0 3 7 14 21 and 28 in order to collect 3 mL of blood (I could not perform blood sampling without killing rats due to the volume required for performing the measurement).
    In this case I guess I am NOT in the case of true longitudinal data since triplicate measurements of the biomarker concentration in blood were performed on 3 different rats (one measurement per sacrified rat) instead of on one single rat.
    For the same reason, I suppose I am also NOT in the case of a true repeated measure (like doing blood sampling at different dates on humans instead of killing them!).
    And my data are also NOT clustered data.
    So what should I use?
    Do you think it is ok if I use a repeated measure ANOVA from GLM with Bonferroni post-hoc multiple-comparisons test to compare my triplicates means at each Day of my kinetic study for a given dose (this would come down to consider that my triplicates were made on one single rat at each date. But do I have the right to do this since it is not the case) ?
    How should I treat my dataset rigorously?
    Thanks for your support and sorry for this probably naïve question (and my approximate English),
    Yann.

    • Karen says

      Hi Yann,

      Exactly. This design, as you describe it, is a great example of one where even though Time (aka Day) is an independent variable, it is NOT repeated measures, longitudinal, or clustered.

      Another way to say it is Day is a between-subjects factor, not within-subjects.

      Just run it as a regular GLM, with Day as an independent variable, assuming all distributional assumptions are met.

      You don’t mention what DV is, so I’m assuming it’s continuous and your residuals are normally distributed, and I’m also assuming there isn’t some other form of clustering, like rats being grouped into litters.

      Karen

  18. HdS says

    Hello,
    thank you for this interesting article.
    Dou you have some information about analyzing longitudinal data with SEM? I found no good advice for this kind of research.

    In political sciences we have some other research desgin like time-series and time-series-cross-sectional analysis. They seem to be just another word for longitudinal and Longitudinal-clsutered, if I’m not mistaken.

    • Karen says

      Hi Christian,

      I know that Singer & Willett’s “Applied Longitudinal Data Analysis” has a chapter on it. Chapter 8. It doesn’t discuss software specifically, but is a great explanation. Here’s the book’s companion site: http://gseacademic.harvard.edu/alda/

      I have also seen references to multilevel modeling in general with MPlus, which is one SEM package. I haven’t used it myself, but I once went to a workshop by the creator, Bengt Muthen, and he’s amazing. A lot of energy and brilliant. I believe he gives these workshops regularly and I believe there is one on multilevel models. Here’s the web site: http://www.statmodel.com/index.shtml

      Karen

    • Karen says

      Hi Christian,

      I’ve always thought of time series as longitudinal on steroids. You can certainly think of a study with 4-5 time points as longitudinal, but time series seems to imply many, many more time points. But honestly, I don’t know where the transition would be.

      Another term I’ve seen for longitudinal-clustered is panel data.

      Definitely another case of each field making up their own name for the same concepts.
      Karen


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.