14 Analysis Pipeline

The DCC’s analysis pipeline is hosted on github: https://github.com/UW-GAC/analysis_pipeline

14.1 Running on a local cluster

To run a burden test on our local SGE cluster, first we create a config file and call it assoc_window_burden.config:

out_prefix "test"
gds_file "testdata/1KG_phase3_subset_chr .gds"
phenotype_file "testdata/1KG_phase3_subset_annot.RData"
null_model_file "testdata/null_model.RData"
null_model_params "testdata/null_model.params"
variant_include_file "testdata/variant_include_chr .RData"
alt_freq_max "0.1"
test "burden"
test_type "score"
genome_build "hg19"

We will use the python script assoc.py to submit all jobs. First we look at the available options:

setenv PIPELINE /projects/topmed/working_code/analysis_pipeline_devel
$PIPELINE/assoc.py --help

Let’s run a sliding window test on chromosomes 1-10. We will also specify the cluster type, although UW_Cluster is actually the default. The cluster file is a JSON file that can override default values for the cluster configuration. In this case, we are changing the memory requirements for each job to only reserve a small amount of memory on each cluster node. The last argument is our config file.

First, we print the commands that will be be run without actually submitting jobs:

$PIPELINE/assoc.py \
    --chromosomes 1-10 \
    --cluster_type UW_Cluster \
    --cluster_file test_cluster_cfg.json \
    --print_only \
    window \
    testdata/assoc_window_burden.config

The default segment length is 10,000 kb, but we can change that to 50,000 kb when we submit:

$PIPELINE/assoc.py \
    --chromosomes 1-10 \
    --cluster_type UW_Cluster \
    --cluster_file test_cluster_cfg.json \
    --segment_length 50000 \
    window \
    testdata/assoc_window_burden.config

We can use the qstat command to check the status of our jobs.