5 Phenotype Harmonization

To increase your sample set, you may need to combine phenotype data from different studies in order to run a cross-study analysis. The studies involved may have collected data in different ways, used different protocols or measurement units, or used different cutpoints to determine case status. The process of manipulating the phenotype data from different studies so that they can be analyzed together is called “phenotype harmonization”.

In this exercise, we assume that you have created a phenotype harmonization plan for height, sent it to members from three studies to perform the harmonization, and received a harmonized phenotype file from each study. We will generate some diagnostic information about the harmonized phenotype.

The exercise uses 1000 Genomes data, with simulated phenotypes for study, age, and height. The example phenotype files shown here are very simplified compared to how actual studies store and organize their their data.

In this exercise, we will be using dplyr for a lot of the data manipulation, so load it now.

library(dplyr)

5.1 Inspect individual study data in R

The first step is to read the files into R for processing. Before we begin, you need to download the data from github so you have access to it.

repo_path <- "https://github.com/UW-GAC/topmed_workshop_2018/raw/master"
pheno_files <- c("data/pheno_data_study_1.txt", "data/pheno_data_study_2.txt", "data/pheno_data_study_3.txt")
if (!dir.exists("data")) dir.create("data")
for (pheno_file in pheno_files) {
  if (!file.exists(pheno_file)) download.file(file.path(repo_path, pheno_file), pheno_file)
}

Next, read the study phenotype files into R. In this case, each file is tab-delimited.

study_1 <- read.table("data/pheno_data_study_1.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_1)
##   subject_id sex age height
## 1    HG00096   M  47  165.3
## 2    HG00102   F  49  169.1
## 3    HG00112   M  46  167.9
## 4    HG00114   M  49  169.5
## 5    HG00115   M  35  161.1
## 6    HG00116   M  37  182.2
study_2 <- read.table("data/pheno_data_study_2.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_2)
##   subject_id Sex Age Height
## 1    HG00099   F  40  185.5
## 2    HG00103   M  50  190.8
## 3    HG00106   F  51  165.5
## 4    HG00107   M  39  195.8
## 5    HG00109   M  48  181.5
## 6    HG00111   F  42  194.9
study_3 <- read.table("data/pheno_data_study_3.txt", header = TRUE, sep = "\t", as.is = TRUE)
head(study_3)
##   subject_id sex age height
## 1    HG00097   F  47   57.0
## 2    HG00100   F  45   59.3
## 3    HG00101   M  40   70.0
## 4    HG00105   M  34   62.4
## 5    HG00108   M  47   66.3
## 6    HG00110   F  44   62.7

Look carefully at the output and see if anything looks suspicious.

You may have noticed that one of the studies has given their variables slightly different names than the others. Rename them as appropriate.

names(study_2)
## [1] "subject_id" "Sex"        "Age"        "Height"
study_2 <- study_2 %>%
  rename(sex = Sex, age = Age, height = Height)
# Check that they are correct.
names(study_2)
## [1] "subject_id" "sex"        "age"        "height"

You’ll also want to calculate summaries of the data values to see if anything looks very different than what you expect.

summary(study_1$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   139.6   163.8   169.9   170.2   176.7   200.3
summary(study_2$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   142.1   171.8   181.3   180.8   190.5   218.6
summary(study_3$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   47.00   59.60   63.40   63.42   67.10   79.60

Here, the values that study_3 has given you don’t seem to have the same range as those from study_1 and study_2. In cases like this, you’ll want to follow up with whoever provided the harmonized data to see what’s going on. It could represent an error in calculating the harmonized data values, a true property of the study (e.g., a study containing all children), or something else. In this case, the values were measured in inches instead of centimeters, so they will need to be converted to centimeters to be compatible with the other studies.

study_3 <- study_3 %>%
  mutate(height = height * 2.54)

Calculate the summary again and compare it to the other studies above.

summary(study_3$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   119.4   151.4   161.0   161.1   170.4   202.2

The corrected values look much more similar now.

Note that this sort of error is easy to correct, but it is not uncommon to have more subtle issues that need to be addressed when working with phenotype data. Knowledge of the study design as well as the phenotype area of interest is essential to address them properly. Additionally, different decisions may need to be made for different analyses based on the specific questions they are trying to answer.

5.2 Compare study values

Next we will make some more direct comparisons between the three studies, so we will combine the data into one data frame.

First, add a study identifier to the data frame for organizational purposes.

study_1$study <- "study_1"
study_2$study <- "study_2"
study_3$study <- "study_3"

Combine the three different study data frames into one large data frame for joint analysis. Double check that all column names are the same.

all.equal(names(study_1), names(study_2))
## [1] TRUE
all.equal(names(study_1), names(study_3))
## [1] TRUE
phen <- dplyr::bind_rows(study_1, study_2, study_3)

We can look at the distribution of phenotype data with text-based reports or with plots.

First, inspect distributions with table for categorical traits and with summary for quantitative traits. The commads are shown here for study_1, but you should run them for study_2 and study_3 as well to see if you can see any differences.

table(study_1$sex)
## 
##   F   M 
## 190 185
summary(study_1$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.00   41.00   45.00   45.17   49.00   62.00
summary(study_1$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   139.6   163.8   169.9   170.2   176.7   200.3

It is also helpful to use plots to inspect the distributions of phenotype data. Here, we will look at boxplots of height by study.

library(ggplot2)
ggplot(phen, aes(x = study, y = height)) + geom_boxplot()

You may also want to see the difference in height when you include both study and sex:

ggplot(phen, aes(x = study, fill = sex, y = height)) + geom_boxplot()

These diagnostics are helpful to get a feel for the data. They can help you see if one study is vastly different from the others or detect outlier values that you may want to look into further. Some of the differences could also be accounted for by covariates.

5.3 Using mixed models to compare studies

The quick diagnostics in the previous section let you see if the data from one study are completely different from the others, but such differences could be due to other factors that could be adjusted for in analysis. To account for these other factors, we need to fit a statistical model to the data. Because some of the studies in TOPMed have related individuals, we need to fit a mixed model to account for the correlation in the data. In this case, because the phenotype is quantitative, we will use a linear mixed model. More information about mixed models will be given during presentations tomorrow.

We use the GENESIS R package for fitting the mixed model. This package can accept a correlation matrix as a random effect in the mixed model, instead of requiring a categorical or indicator variable. It therefore can account for the observed genetic relatedness between subjects. It is also the same package that we use for the association analyses, so this exercise provides a brief introduction to the package and some of the associated data structures.

5.3.1 Create an Annotated Data Frame

The first step in fitting the mixed model is to create an Annotated Data Frame. This data structure is provided by the Bioconductor Biobase package, and it contains both the data and metadata.

Next, create the Annotated Data Frame. You should include a description of each variable in the metadata.

library(Biobase)

metadata <- data.frame(labelDescription = c(
  "subject identifier",
  "subject's sex",
  "age at measurement of height",
  "subject's height in cm",
  "study identifier"
))

annot <- AnnotatedDataFrame(phen, metadata)

# access the data with the pData() function
head(pData(annot))
##   subject_id sex age height   study
## 1    HG00096   M  47  165.3 study_1
## 2    HG00102   F  49  169.1 study_1
## 3    HG00112   M  46  167.9 study_1
## 4    HG00114   M  49  169.5 study_1
## 5    HG00115   M  35  161.1 study_1
## 6    HG00116   M  37  182.2 study_1
# access the metadata with the varMetadata() function
varMetadata(annot)
##                        labelDescription
## subject_id           subject identifier
## sex                       subject's sex
## age        age at measurement of height
## height           subject's height in cm
## study                  study identifier

Save the AnnotatedDataFrame for future use.

save(annot, file = "data/phenotype_annotation.RData")

5.3.2 Obtain the genetic relatedness matrix

Becase it is an input to the mixed model, we next need to download the genetic relatedness matrix calculated for these subjects.

data_path <- "https://github.com/UW-GAC/analysis_pipeline/raw/master/testdata"
grmfile <- "data/grm.RData"
if (!file.exists(grmfile)) download.file(file.path(data_path, basename(grmfile)), grmfile)
grm <- TopmedPipeline::getobj(grmfile)
rownames(grm$grm) <- colnames(grm$grm) <- grm$sample.id

The GENESIS code to fit the mixed model also requires a sample.id column. Typically the sample.id column represents a sample identifier, not a subject id. In this case, we are only working with subject-level data, so we can use the subject identifier as the sample identifier for model-fitting purposes.

annot$sample.id <- annot$subject_id

It also requires that the AnnotatedDataFrame sample.id is in the same order as the samples in the genetic relatedness matrix.

# put the phenotype data in the same order as the GRM
annot <- annot[match(grm$sample.id, annot$sample.id), ]

5.3.3 Fit a mixed model without study

We will first fit a mixed model that allows us to see if the mean of the height phenotype is different by study after adjusting for other covariates. In this case, we will adjust for age and sex, but not for study, because we are interested in seeing differences in mean height by study. We will also include the genetic relatedness matrix as a random effect to account for relatedness between the participants.

outcome <- "height"
covars <- c("sex", "age")

covariance_matrices <- grm$grm

library(GENESIS)
mod_1 <- GENESIS::fitNullModel(annot, outcome = outcome, covars = covars,
                     cov.mat = covariance_matrices)
## [1]   107.6619072   107.6619072 -4584.3333768     0.9797637
## [1]   104.9674150   108.1835219 -4584.0640858     0.9879716
## [1]    54.204363   145.886059 -4580.563495     1.014608
## [1]    59.010099   144.334151 -4580.528313     1.000327
## [1]    59.10315   144.31520 -4580.52830     1.00000
## [1]    59.10171   144.31651 -4580.52830     1.00000
## [1]    59.10174   144.31649 -4580.52830     1.00000

The output of fitNullModel is a list with a number of named elements

names(mod_1)
##  [1] "family"            "hetResid"          "varComp"          
##  [4] "varCompCov"        "fixef"             "betaCov"          
##  [7] "fitted.values"     "resid.marginal"    "resid.conditional"
## [10] "logLik"            "logLikR"           "AIC"              
## [13] "workingY"          "outcome"           "model.matrix"     
## [16] "group.idx"         "cholSigmaInv"      "converged"        
## [19] "zeroFLAG"          "RSS"               "Ytilde"           
## [22] "resid"             "CX"                "CXCXI"            
## [25] "sample.id"

The elements that we will work with in this exercise are:

  • converged: an indicator of whether the model successfully converged
  • model.matrix: The matrix of subject-covariate values used to fit the model
  • fixef: The fitted fixed effects
  • betaCov: The covariance of the fitted fixed effects
  • resid.marginal: The (marginal) residuals from the model, which have been adjusted for the fixed effects but not for the covariance structure
  • varComp: The fitted variance component for each input covariance matrix

Make sure the model converged.

mod_1$converged
## [1] TRUE

Now, add the residuals to the phenotype data frame for plotting. We need to make sure that we are matching each residual value to the correct subject. In this case, we already ordered the AnnotatedDataFrame to match the genetic relatedness marix, but this may not always be the case (for example, if subjects are excluded due to missing phentoype data). To match the same subject’s values together, we use the row names of the model.matrix element of the output, which are in the same order as the residual matrix, and the subject_id column of the annotated data frame. We then match the row names (and therefore the residuals) to the subject identifier in the phenotype file using the base R function match.

j <- match(annot$subject_id, rownames(mod_1$model.matrix))
annot$residuals <- mod_1$resid.marginal[j]

Next, we want to check if the different studies have the same mean height after adjustment for other covariates (here, age and sex). We will first do this qualitatively by making a boxplot of the residuals by study.

ggplot(pData(annot), aes(x = study, y = residuals)) + geom_boxplot()

From the boxplot, it is clear that the different studies have different mean heights, even after adjustment for sex and age. At this point, you would need to determine if the differences are acceptable for use in a combined analysis.

5.3.4 Fit a model with study

Next, we can look at a model that adjusts for other covariates as well as study. This model allows us to run a statistical test on the fitted study means and to qualitatively check if the variances are the same after adjusting for mean effects. The outcome is the same, but we now add the study as a covariate. We also allow for group-specific residual variance by study using the group.var argument to fitNullModel.

# include the study in the covariates
covars <- c("age", "sex", "study")

mod_2 <- GENESIS::fitNullModel(annot, outcome = outcome, covars = covars,
                            group.var = "study", cov.mat = covariance_matrices)

The fixef element now includes effects for study:

mod_2$fixef
##                       Est         SE         Stat         pval
## (Intercept)  164.27654459 3.04753893 2905.7121853 0.000000e+00
## age            0.06044813 0.06642284    0.8281916 3.627960e-01
## sexM           6.43111542 0.65686823   95.8552913 1.235952e-22
## studystudy_2  10.59046078 0.78904348  180.1473364 4.500398e-41
## studystudy_3  -8.94782432 0.82087372  118.8179863 1.147942e-27

The mixed model also shows the differences in mean height by study.

Finally, we want to check if the height distributions from the different studies have the same variance. Start by looking at the variance components (varComp) element of the model.

mod_2$varComp
##       V_A V_study_1 V_study_3 V_study_2 
##  54.25789  42.80994 116.52075  98.09390

V_A represents the variance in height due to genetic relatedness. The other variance components (V_study_1, V_study_2, and V_study_3) represent the residual variance in each study. The fitted values of the variance components are different for the different studies, indicating that the distributions of height in the three studies have different variance even after accounting for the other covariates.

We can also show the same information by plotting the residuals by study. We first have to add the residuals from this model to the AnnotatedDataFrame, again making sure to match them correctly by subject.

j <- match(annot$subject_id, rownames(mod_2$model.matrix))
annot$residuals <- mod_2$resid.marginal[j]

Next make a boxplot of the residuals by study.

ggplot(pData(annot), aes(x = study, y = residuals)) +
  geom_boxplot()

Both methods of looking at the variance components indicate that study 1 has a smaller residual variance than the others.

5.4 Final considerations

We have determined that the different studies have both different mean and different variance by study for height. Before performing genotype-phenotype association tests with these data, you would need to think carefully about whether the phenotype is homogeneous enough to be analyzed together. In some cases, there may be a valid reason for different means or variances, for example:

  • different heights in different study populations, such as a study composed primarily of Asian participants vs. a study with primarily European participants or a study of all men vs. a study of all women;
  • possible secular trends in height, such as comparing the Framingham Original cohort from ~1950 to a cohort from the present day.

In other cases, there may be good reasons to exclude one or more studies, for example:

  • a systematic measurement error in one study
  • miscalculation or misinterpretation of the harmonization algorithm
  • study populations that are too different to be compared, such as trying to include a study composed primarily of children with one composed of adults in a height analysis

It may be necessary to look at other variables that you had not previously considered. Studies may have used different measurement equipment or calibrated their data differently. There might also be other batch effects due to lab procedures or assays that could result in differences in the variance or mean by study. The other variables that you may need to consider are highly dependent both on the phenotype being harmonized and on how a given study has been designed.

Unfortunately there is no single set of guidelines you can use to decide how to proceed with analysis of a phenotype. It is necessary to involve both domain experts and study experts to determine whether the phenotype is homogeneous enough to use in cross-study analysis.