9 Analysis Commons

9.1 Outline

Introduction to web-interface
Running a single variant analysis
Workflows and monitoring jobs
Running aggregate tests (SKAT)
Run batch jobs from the command line
Writing your own Apps

9.2 Web Interface and Running an Analysis Application

9.2.1 Exercise 1) Run a single variant analysis.

Log into http://dnanexus.com using the Analysis Commons user name and password listed on the website.
Should be in the form of Username:topmed_# and Password:TOPMed_#. Ignore warning about default billing account.

9.2.1.1 Part 1: Run null model

Navigate to and select (wgs:tools/genesis_nullmodel)

File inputs:
* phenofile -> phenotype/1KG_pheno.csv
* kinship -> kinship/1KG_kins.Rda

Parameter inputs:
* output folder: /output/YOURFOLDERNAME
* outcome (Column name of the outcome variable): outcome
* covariates (case sepecific): Population,sex
* prefix for output filename: nullmodel_outcome
* pheno_id: sample.id
* Note: Other options can be left as their defaults * Note: The job may finish instantaneously if you don’t change the output file name. It knows that you are running the exact same job and will just reuse results from previous analyses.

9.2.1.2 Part 2: Run association tests

Navigate to and select (wgs:tools/genesis_tests)

Parameter inputs:
* output folder: output/YOURFOLDERNAME
* prefix for output filename: chr1_single
* test_type: Single
* Note: Other options can be left as their defaults

9.2.2 Exercise 2) Run SKAT test grouping variants into gene transcript regions and limit the variants to those with a CADD phred score > 2 and MAF <= 5%.

File inputs:
* null_model -> /output/YOURFOLDERNAME/nullmodel_outcome.Rda (output from part1. If yours has not completed, you can select output/DEMO/nullmodel_outcome.Rda) * genotypefile -> genotypes/1KG_phase3_subset_chr1.gds
* annotation -> annotation/1KG_annotation_CHR1.txt
* genefile -> aggregation/AggUnit_CHR1_ucscgene.csv

Parameter inputs:
* output folder: output/YOURFOLDERNAME
* outputfilename: skat_chr1_geneBased_CADDgt2
* test_type: SKAT
* snp_filter: CADD_phred>2
* min_mac:0
* top_maf: 0.05
* weights: c(1,25)

9.3 Command line interface

References:
* Command Line Interface Quickstart
* Index of dx commands

9.3.1 Log in to AWI

Replace topmed_## with the user ID from your handout

$ ssh topmed_##@34.208.147.133
You will be prompted for your password, e.g. TOPMed_## (Note capitalization)
_Please ignore login warnings

$ source /usr/local/dx-toolkit/environment

$ dx login  --timeout 2h
    Enter the following at the prompts
        username: topmed_##
        password: TOPMed_##
        project:wgs ( type 0 to select wgs )

You can select or change project once you are logged in
$ dx select wgs

9.3.2 Exercise 3) Navigate directories, make output directory, examine files

File paths: <project>:/path/to/file.txt
Example: wgs:/phenotypes/1KG_pheno.csv

List directory contents:

$ dx select wgs
$ dx ls
$ dx ls /tools
$ dx ls wgs:/tools

Get results from project

$ dx download wgs:/phenotype/1KG_pheno.csv
$ ls
$ head 1KG_pheno.csv

9.3.3 Exercise 4) Run single variant analysis from command line using bash script

Open the single_multichrom.sh bash script and edit to replace the output directory “YOURNAME” to your folder

$ dx describe tools/genesis_tests

Either edit using nano

$ nano single_multichrom.sh 

Run the App.  Will loop over 2 chromosomes running the single variant analyses
$ ./single_multichrom.sh

9.4 Writing your own Apps

9.4.1 Exercise 5) Write an App that creates phenotype residuals and performs an inverse normal transform

Use app wizard to create template

$ dx-app-wizard

App Name: make_residuals
Title []: Create inverse normal transformed residuals

1st input name (<ENTER> to finish): phenofile
Label (optional human-readable name) []: CSV phenotype file
Choose a class (<TAB> twice for choices): file
This is an optional parameter [y/n]: n

2nd input name (<ENTER> to finish): model
Label (optional human-readable name) []: model for creating residuals (e.g. outcome~sex+Population )
Choose a class (<TAB> twice for choices): string
This is an optional parameter [y/n]: n

3rd input name (<ENTER> to finish): prefix
Label (optional human-readable name) []: Output filename prefix
Choose a class (<TAB> twice for choices): string
This is an optional parameter [y/n]: n

4th input name (<ENTER> to finish): <ENTER>

1st output name (<ENTER> to finish): output
Label (optional human-readable name) []: 
Choose a class (<TAB> twice for choices): file

Timeout policy [48h]: 1h
Programming language: bash

*Use defaults for other options*

Look at the files created by the wizard

cd make_residuals/
ls
more dxapp.json

Edit App executable to run an R script

$ vi src/make_residuals.sh
main() {

    echo "Value of phenofile: '$phenofile'"
    echo "Value of model: '$model'"
    echo "Value of prefix: '$prefix'"

    dx download "$phenofile" -o phenofile

    ## ADD THIS LINE ##
    Rscript /make_resid.R $model

    output=$(dx upload output --brief)

    ## ADD THIS LINE ##
    dx mv ${output} ${prefix}.csv
    
    dx-jobutil-add-output output "$output" --class=file
    
}

Create an R script that does the ‘work’ $ vi resources/make_resid.R

args<-commandArgs(TRUE)
model <- as.formula(args[1])
print(model)
pheno = read.csv("phenofile",as.is=T)
pheno$resid = residuals(lm(model,data=pheno))
pheno$invnt_resid =  with(pheno,qnorm((rank(resid,na.last="keep")-0.5)/sum(!is.na(resid))))

write.csv(pheno,file="output",row.names=F)

Build the App

$ dx build -f make_residuals --destination=output/YOURNAME/make_residuals

Run the App

$ dx run output/YOURNAME/make_residuals -iphenofile=phenotype/1KG_pheno.csv \
-imodel=outcome~sex+Population -iprefix=1kg_pheno_invnt \
--destination=output/YOURNAME --yes

Monitor Progress

$ dx watch jobid

9.4.2 Optional Exercise 6) Make QQ plot

Make QQ plot of your single variant results.
Select results from the multiple chromosome run (chr21 and chr22).

You will need to identify the p-value column name. To view the results file try these options:

dx download to download the results for viewing.
View file through web interface using Visualize ( next to Monitor near top of the page ) and select Gzipped File Previewer
Pipe zipped file though regular linux commands dx cat to view column names

$ dx cat output/folder/file | gunzip | head

Once you know the name of the p-value column, run qqplot first through web interface and then try running interactivly from the web interface then from the command line.

$ dx run tools/qqplot

Note: the plot label must not contain spaces.

9.4.3 Optional Exercise 8) Create a regional association plot using LD extracted from your data set

This process requires two steps, one to extract the LD for all variants in the region and one to create the plot. Sequencing data sets often contain variants not in external refernce panels, so it is helpful to create your own LD reference.

Step 1: Run GILD (GDS Into LD) App (tools/gild_v1)

File inputs: * gds_file -> genotypes/1KG_phase3_subset_chr22.gds

Parameter inputs:

lead_snp -> 22:17105517
start_pos -> 1
stop_pos -> 51237069
label for results file -> “LD_chr22” output_LD_filename
output/YOURNAME

Note: this can take 10-15 mins to complete

Step 2: Run AssocPlot (tools/assocplot)

File inputs:

datafile -> single variant association results output for chr22
ldfile -> Output file from Step 1 with .ld suffix

Parameter inputs (Minimum required to have the App run successfully with GENESIS output):

Output folder -> output/YOURNAME
Marker Column Name -> snpID
P value Column Name -> Score.pval
Index SNP -> 22:17105517