12 Aggregate tests

12.1 Aggregate unit for association testing exercise

Now you can proceed to an assocation testing exercise. You will be using a slightly different gene-based aggregation unit for the assocation testing exercise. In this exercise, the genic units include SNP variants from all chromosomes (no indels, and not just chromosome 22 as before), each genic unit is expanded to include the set of SNPs falling within a GENCODE-defined gene along with 20 kb flanking regions upstream and downstream of that range, and the positions are in genome build hg19 (so that the annotation positions are consistent with the build used for genotyping data in the workshop). This set of aggregation units is not filtered by CADD score or consequence.

As before, the aggregation units are defined in an R dataframe. Each row of the dataframe specifies a variant (chr, pos, ref, alt) and the group identifier (group_id) it is a part of. Mutiple rows with different group identifiers can be specified to assign a variant to different groups (a variant can be assigned to mutiple genes).

Begin by loading the aggregation units using TopmedPipeline::getobj():

library(dplyr)
repo_path <- "https://github.com/UW-GAC/SISG_2019/raw/master"
if (!dir.exists("data")) dir.create("data")
aggfile <- "data/variants_by_gene.RData"
if (!file.exists(aggfile)) download.file(file.path(repo_path, aggfile), aggfile)
aggunit <- TopmedPipeline::getobj(aggfile)
names(aggunit)

## [1] "group_id" "chr"      "pos"      "ref"      "alt"

head(aggunit)

## # A tibble: 6 x 5
##   group_id           chr        pos ref   alt  
##   <chr>              <fct>    <int> <chr> <chr>
## 1 ENSG00000131591.13 1      1025045 C     T    
## 2 ENSG00000169962.4  1      1265550 C     T    
## 3 ENSG00000205090.4  1      1472676 T     C    
## 4 ENSG00000171603.12 1      9788518 G     A    
## 5 ENSG00000204624.6  1     11593461 C     T    
## 6 ENSG00000270914.1  1     12068870 G     A

# an example of variant that is present in mutiple groups
mult <- aggunit %>%
    group_by(chr, pos) %>%
    summarise(n=n()) %>%
    filter(n > 1)
inner_join(aggunit, mult[2,1:2])

## # A tibble: 2 x 5
##   group_id          chr        pos ref   alt  
##   <chr>             <fct>    <int> <chr> <chr>
## 1 ENSG00000187952.8 1     21742183 G     A    
## 2 ENSG00000227001.2 1     21742183 G     A

12.2 Association testing with aggregate units

We can run a burden test or SKAT on each of these units using assocTestAggregate. We define a SeqVarListIterator object where each list element is an aggregate unit. The constructor expects a GRangesList, so we use the TopmedPipeline function aggregateGRangesList to quickly convert our single dataframe to the required format. This function can account for multiallelic variants (the same chromosome, position, and ref, but different alt alleles).

library(TopmedPipeline)
library(SeqVarTools)
library(GENESIS)
gdsfile <- "data/1KG_phase3_subset_chr1.gds"
if (!file.exists(gdsfile)) download.file(file.path(repo_path, gdsfile), gdsfile)
gdsfmt::showfile.gds(closeall=TRUE) # make sure file is not already open
gds <- seqOpen(gdsfile)
annotfile <- "data/sample_phenotype_pcs.RData"
if (!file.exists(annotfile)) download.file(file.path(repo_path, annotfile), aggfile)
annot <- getobj(annotfile)
seqData <- SeqVarData(gds, sampleData=annot)
    
# subset to chromosome 1
aggunit1 <- filter(aggunit, chr == 1)
aggVarList <- aggregateGRangesList(aggunit1)
length(aggVarList)

## [1] 127

head(names(aggVarList))

## [1] "ENSG00000131591.13" "ENSG00000169962.4"  "ENSG00000205090.4" 
## [4] "ENSG00000171603.12" "ENSG00000204624.6"  "ENSG00000270914.1"

aggVarList[[1]]

## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |         ref         alt
##          <Rle> <IRanges>  <Rle> | <character> <character>
##   [1]        1   1025045      * |           C           T
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

iterator <- SeqVarListIterator(seqData, variantRanges=aggVarList, verbose=FALSE)

As in the previous section, we must load the null model before running the association test.

if (!exists("nullmod")) {
    nmfile <- "data/null_mixed_model.RData"
    if (!file.exists(nmfile)) download.file(file.path(repo_path, nmfile), nmfile)
    nullmod <- getobj(nmfile)
}

assoc <- assocTestAggregate(iterator, nullmod, test="Burden", AF.max=0.1, weight.beta=c(1,1))

## # of selected samples: 100

names(assoc)

## [1] "results"     "variantInfo"

head(assoc$results)

##                    n.site n.alt n.sample.alt       Score   Score.SE
## ENSG00000131591.13      0     0            0          NA         NA
## ENSG00000169962.4       0     0            0          NA         NA
## ENSG00000205090.4       1     1            1 -0.08038064 0.08682388
## ENSG00000171603.12      0     0            0          NA         NA
## ENSG00000204624.6       0     0            0          NA         NA
## ENSG00000270914.1       1     1            1 -0.05287495 0.08051531
##                    Score.Stat Score.pval
## ENSG00000131591.13         NA         NA
## ENSG00000169962.4          NA         NA
## ENSG00000205090.4  -0.9257895  0.3545554
## ENSG00000171603.12         NA         NA
## ENSG00000204624.6          NA         NA
## ENSG00000270914.1  -0.6567068  0.5113695

head(names(assoc$variantInfo))

## [1] "ENSG00000131591.13" "ENSG00000169962.4"  "ENSG00000205090.4" 
## [4] "ENSG00000171603.12" "ENSG00000204624.6"  "ENSG00000270914.1"

assoc$variantInfo[[3]]

##   variant.id chr     pos ref alt allele.index n.obs  freq weight
## 1          5   1 1472676   T   C            1   100 0.005      1

qqPlot(assoc$results$Score.pval)

12.3 Exercise

Since we are working with a subset of the data, many of the genes listed in group_id have a very small number of variants. Create a new set of units based on position rather than gene name, using the TopmedPipeline function aggregateGRanges. Then run SKAT using those units and a SeqVarRangeIterator.