12 Aggregate tests
12.1 Aggregate unit for association testing exercise
Now you can proceed to an assocation testing exercise. You will be using a slightly different gene-based aggregation unit for the assocation testing exercise. In this exercise, the genic units include SNP variants from all chromosomes (no indels, and not just chromosome 22 as before), each genic unit is expanded to include the set of SNPs falling within a GENCODE-defined gene along with 20 kb flanking regions upstream and downstream of that range, and the positions are in genome build hg19 (so that the annotation positions are consistent with the build used for genotyping data in the workshop). This set of aggregation units is not filtered by CADD score or consequence.
As before, the aggregation units are defined in an R dataframe. Each row of the dataframe specifies a variant (chr, pos, ref, alt) and the group identifier (group_id) it is a part of. Mutiple rows with different group identifiers can be specified to assign a variant to different groups (a variant can be assigned to mutiple genes).
Begin by loading the aggregation units using TopmedPipeline::getobj()
:
library(dplyr)
repo_path <- "https://github.com/UW-GAC/SISG_2019/raw/master"
if (!dir.exists("data")) dir.create("data")
aggfile <- "data/variants_by_gene.RData"
if (!file.exists(aggfile)) download.file(file.path(repo_path, aggfile), aggfile)
aggunit <- TopmedPipeline::getobj(aggfile)
names(aggunit)
## [1] "group_id" "chr" "pos" "ref" "alt"
head(aggunit)
## # A tibble: 6 x 5
## group_id chr pos ref alt
## <chr> <fct> <int> <chr> <chr>
## 1 ENSG00000131591.13 1 1025045 C T
## 2 ENSG00000169962.4 1 1265550 C T
## 3 ENSG00000205090.4 1 1472676 T C
## 4 ENSG00000171603.12 1 9788518 G A
## 5 ENSG00000204624.6 1 11593461 C T
## 6 ENSG00000270914.1 1 12068870 G A
# an example of variant that is present in mutiple groups
mult <- aggunit %>%
group_by(chr, pos) %>%
summarise(n=n()) %>%
filter(n > 1)
inner_join(aggunit, mult[2,1:2])
## # A tibble: 2 x 5
## group_id chr pos ref alt
## <chr> <fct> <int> <chr> <chr>
## 1 ENSG00000187952.8 1 21742183 G A
## 2 ENSG00000227001.2 1 21742183 G A
12.2 Association testing with aggregate units
We can run a burden test or SKAT on each of these units using assocTestAggregate
. We define a SeqVarListIterator
object where each list element is an aggregate unit. The constructor expects a GRangesList
, so we use the TopmedPipeline function aggregateGRangesList
to quickly convert our single dataframe to the required format. This function can account for multiallelic variants (the same chromosome, position, and ref, but different alt alleles).
library(TopmedPipeline)
library(SeqVarTools)
library(GENESIS)
gdsfile <- "data/1KG_phase3_subset_chr1.gds"
if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)
annotfile <- "data/sample_phenotype_pcs.RData"
if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), aggfile)
annot <- getobj(annotfile)
seqData <- SeqVarData(gds, sampleData=annot)
# subset to chromosome 1
aggunit1 <- filter(aggunit, chr == 1)
aggVarList <- aggregateGRangesList(aggunit1)
length(aggVarList)
## [1] 127
head(names(aggVarList))
## [1] "ENSG00000131591.13" "ENSG00000169962.4" "ENSG00000205090.4"
## [4] "ENSG00000171603.12" "ENSG00000204624.6" "ENSG00000270914.1"
aggVarList[[1]]
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | ref alt
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 1 1025045 * | C T
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
iterator <- SeqVarListIterator(seqData, variantRanges=aggVarList, verbose=FALSE)
As in the previous section, we must load the null model before running the association test.
if (!exists("nullmod")) {
nmfile <- "data/null_mixed_model.RData"
if (!file.exists(nmfile)) download.file(file.path(repo_path, nmfile), nmfile)
nullmod <- getobj(nmfile)
}
assoc <- assocTestAggregate(iterator, nullmod, test="Burden", AF.max=0.1, weight.beta=c(1,1))
## # of selected samples: 100
names(assoc)
## [1] "results" "variantInfo"
head(assoc$results)
## n.site n.alt n.sample.alt Score Score.SE
## ENSG00000131591.13 0 0 0 NA NA
## ENSG00000169962.4 0 0 0 NA NA
## ENSG00000205090.4 1 1 1 -0.08038064 0.08682388
## ENSG00000171603.12 0 0 0 NA NA
## ENSG00000204624.6 0 0 0 NA NA
## ENSG00000270914.1 1 1 1 -0.05287495 0.08051531
## Score.Stat Score.pval
## ENSG00000131591.13 NA NA
## ENSG00000169962.4 NA NA
## ENSG00000205090.4 -0.9257895 0.3545554
## ENSG00000171603.12 NA NA
## ENSG00000204624.6 NA NA
## ENSG00000270914.1 -0.6567068 0.5113695
head(names(assoc$variantInfo))
## [1] "ENSG00000131591.13" "ENSG00000169962.4" "ENSG00000205090.4"
## [4] "ENSG00000171603.12" "ENSG00000204624.6" "ENSG00000270914.1"
assoc$variantInfo[[3]]
## variant.id chr pos ref alt allele.index n.obs freq weight
## 1 5 1 1472676 T C 1 100 0.005 1
qqPlot(assoc$results$Score.pval)
12.3 Exercise
Since we are working with a subset of the data, many of the genes listed in group_id
have a very small number of variants. Create a new set of units based on position rather than gene name, using the TopmedPipeline function aggregateGRanges
. Then run SKAT using those units and a SeqVarRangeIterator
.