Import data to workspace: GSR example

We have uploaded GIANT Genomic Summary Results (GSR) to this workspace. In this notebook, we prepare the metadata about the analysis for import into an AnVIL workspace.

Install and load R packages

# get the latest version of AnvilDataModels from github
#remotes::install_github("UW-GAC/AnvilDataModels")
library(AnVIL)
library(AnvilDataModels)
library(tidyverse)

Prepare tables according to data model

Metadata describing the analysis is stored in the association_analysis table. We save this as a set of “field” and “value” pairs for input to the workflow that assigns a unique identifier for each analysis.

fields <- list(
    gsr_source = "https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files",
    pubmed_id = "29273807",
    first_author = "V Turcot",
    gsr_source_url = "https://doi.org/10.1038%2Fs41588-017-0011-x",
    consent_code = "NRES",
    upload_date = "2023-01-03",
    contributor_contact = "sdmorris@uw.edu",
    trait = "BMI",
    trait_type = "quantitative",
    trait_unit = "kg / m^2",
    trait_transformation = "inverse normal",
    trait_definition = "weight / height",
    covariates = "age | age^2 | ancestry PCs",
    concept_id = "3038553",
    reference_assembly = "GRCh37",
    n_variants = "246328",
    genotyping_technology = "exome array",
    genotyping_platform = "Illumina",
    is_imputed = "FALSE",
    n_samp = "526508",
    n_effective = "526508",
    age_min = "18",
    cohorts = "GIANT",
    is_meta_analysis = "TRUE",
    population_descriptor = "reported ancestry",
    population_labels = "European | South Asian | African | East Asian | Hispanic",
    population_proportions = "85 | 6 | 5 | 2 | 2",
    countries_of_recruitment = "Australia | Bangladesh | China | Denmark | Estonia | Finnland | Germany | Greece | Iceland | Ireland | Italy | Netherlands | Norway | Philippines | Sweden | Taiwan | UK | USA",
    analysis_method = "score-statistics-based association analysis",
    analysis_software = "RAREMETALWORKER | RVTEST"
)
analysis <- tibble(field=names(fields),
                   value=unlist(fields))

The GIANT data is distributed as a single file per analysis. We read in the data, update the column names so it conforms to the PRIMED data model, and write each chromosome to a separate file.

bucket <- avbucket()
bmi_file <- "BMI_All_ancestry.txt"
gsutil_cp(file.path(bucket, bmi_file), ".")
Copying gs://fc-19bbd8d9-a925-4192-b60e-f5966e7d397c/BMI_All_ancestry.txt...
/ [0/1 files][    0.0 B/ 14.9 MiB]   0% Done                                    
/ [1/1 files][ 14.9 MiB/ 14.9 MiB] 100% Done                                    
-

Operation completed over 1 objects/14.9 MiB.                                     
dat <- read_tsv(bmi_file)
Rows: 246328 Columns: 10
── Column specification ──────────────────────────────────────────────────
Delimiter: "\t"
chr (6): CHR, REF, ALT, SNPNAME, GMAF, ExAC_MAF
dbl (4): POS, beta, se, Pvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dat)
dat2 <- dat %>%
  rename(chromosome = CHR,
         position = POS,
         rsID = SNPNAME,
         ref_allele = REF,
         alt_allele = ALT,
         p_value = Pvalue) %>%
  mutate(strand = "+",
         effect_allele = ref_allele,
         other_allele = alt_allele,
         effect_allele_freq = NA,
         n_samp = NA,
         is_imputed = FALSE,
         direction_of_effect = NA,
         heterogeneity_p_value = NA,
         heterogeneity_I2 = NA)

Files are linked to analyses. The md5 hash of each file is used to generate the primary key for the ‘file’ table. The md5 should be computed before uploading to the workspace and checked after upload to make sure the upload was successful.

chrs <- unique(dat2$chromosome)
chr_file <- lapply(chrs, function(c) {
    # select only this chromosome
    dat_chr <- filter(dat2, chromosome == c)
    chr_file <- paste0("BMI_All_ancestry_chr", c, ".txt")
    
    # write tsv file
    write_tsv(dat_chr, file=chr_file, na="")
    
    # get md5sum for file table
    md5 <- tools::md5sum(chr_file)
    
    # copy file to google bucket
    gsutil_cp(chr_file, bucket)
    
    ## return row for file table
    tibble(
        md5sum = md5,
        n_variants = nrow(dat_chr),
        file_path = file.path(bucket, chr_file),
        file_type = "data",
        chromosome = c
    )
})

Write tables as files to workspace bucket

To check the tables using a workflow, they must be written as files to the workspace bucket.

outfile <- "GIANT_BMI_file_table.tsv"
write_tsv(file_table, outfile)
gsutil_cp(outfile, bucket)

outfile <- "GIANT_BMI_analysis_table.tsv"
write_tsv(analysis, outfile)
gsutil_cp(outfile, bucket)

Check tables against data model

Once all tables have been created, we can check that they conform to the data model. This is most easily accomplished by providing the paths to the tables in TSV format as input to the validate_gsr_model workflow. This workflow will also check that md5sums in the association_file table match the files in the google bucket, and that all data files match the data dictionary.

LS0tCnRpdGxlOiAiUFJJTUVEIGV4YW1wbGUgbm90ZWJvb2s6IEdlbm9taWMgU3VtbWFyeSBSZXN1bHRzIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIEltcG9ydCBkYXRhIHRvIHdvcmtzcGFjZTogR1NSIGV4YW1wbGUKCldlIGhhdmUgdXBsb2FkZWQgR0lBTlQgR2Vub21pYyBTdW1tYXJ5IFJlc3VsdHMgKEdTUikgdG8gdGhpcyB3b3Jrc3BhY2UuIEluIHRoaXMgbm90ZWJvb2ssIHdlIHByZXBhcmUgdGhlIG1ldGFkYXRhIGFib3V0IHRoZSBhbmFseXNpcyBmb3IgaW1wb3J0IGludG8gYW4gQW5WSUwgd29ya3NwYWNlLgoKIyMgSW5zdGFsbCBhbmQgbG9hZCBSIHBhY2thZ2VzCgpgYGB7cn0KIyBnZXQgdGhlIGxhdGVzdCB2ZXJzaW9uIG9mIEFudmlsRGF0YU1vZGVscyBmcm9tIGdpdGh1YgojcmVtb3Rlczo6aW5zdGFsbF9naXRodWIoIlVXLUdBQy9BbnZpbERhdGFNb2RlbHMiKQpsaWJyYXJ5KEFuVklMKQpsaWJyYXJ5KEFudmlsRGF0YU1vZGVscykKbGlicmFyeSh0aWR5dmVyc2UpCmBgYAoKIyMgUHJlcGFyZSB0YWJsZXMgYWNjb3JkaW5nIHRvIGRhdGEgbW9kZWwKCk1ldGFkYXRhIGRlc2NyaWJpbmcgdGhlIGFuYWx5c2lzIGlzIHN0b3JlZCBpbiB0aGUgYXNzb2NpYXRpb25fYW5hbHlzaXMgdGFibGUuIFdlIHNhdmUgdGhpcyBhcyBhIHNldCBvZiAiZmllbGQiIGFuZCAidmFsdWUiIHBhaXJzIGZvciBpbnB1dCB0byB0aGUgd29ya2Zsb3cgdGhhdCBhc3NpZ25zIGEgdW5pcXVlIGlkZW50aWZpZXIgZm9yIGVhY2ggYW5hbHlzaXMuCgpgYGB7cn0KZmllbGRzIDwtIGxpc3QoCiAgICBnc3Jfc291cmNlID0gImh0dHBzOi8vcG9ydGFscy5icm9hZGluc3RpdHV0ZS5vcmcvY29sbGFib3JhdGlvbi9naWFudC9pbmRleC5waHAvR0lBTlRfY29uc29ydGl1bV9kYXRhX2ZpbGVzIiwKICAgIHB1Ym1lZF9pZCA9ICIyOTI3MzgwNyIsCiAgICBmaXJzdF9hdXRob3IgPSAiViBUdXJjb3QiLAogICAgZ3NyX3NvdXJjZV91cmwgPSAiaHR0cHM6Ly9kb2kub3JnLzEwLjEwMzglMkZzNDE1ODgtMDE3LTAwMTEteCIsCiAgICBjb25zZW50X2NvZGUgPSAiTlJFUyIsCiAgICB1cGxvYWRfZGF0ZSA9ICIyMDIzLTAxLTAzIiwKICAgIGNvbnRyaWJ1dG9yX2NvbnRhY3QgPSAic2Rtb3JyaXNAdXcuZWR1IiwKICAgIHRyYWl0ID0gIkJNSSIsCiAgICB0cmFpdF90eXBlID0gInF1YW50aXRhdGl2ZSIsCiAgICB0cmFpdF91bml0ID0gImtnIC8gbV4yIiwKICAgIHRyYWl0X3RyYW5zZm9ybWF0aW9uID0gImludmVyc2Ugbm9ybWFsIiwKICAgIHRyYWl0X2RlZmluaXRpb24gPSAid2VpZ2h0IC8gaGVpZ2h0IiwKICAgIGNvdmFyaWF0ZXMgPSAiYWdlIHwgYWdlXjIgfCBhbmNlc3RyeSBQQ3MiLAogICAgY29uY2VwdF9pZCA9ICIzMDM4NTUzIiwKICAgIHJlZmVyZW5jZV9hc3NlbWJseSA9ICJHUkNoMzciLAogICAgbl92YXJpYW50cyA9ICIyNDYzMjgiLAogICAgZ2Vub3R5cGluZ190ZWNobm9sb2d5ID0gImV4b21lIGFycmF5IiwKICAgIGdlbm90eXBpbmdfcGxhdGZvcm0gPSAiSWxsdW1pbmEiLAogICAgaXNfaW1wdXRlZCA9ICJGQUxTRSIsCiAgICBuX3NhbXAgPSAiNTI2NTA4IiwKICAgIG5fZWZmZWN0aXZlID0gIjUyNjUwOCIsCiAgICBhZ2VfbWluID0gIjE4IiwKICAgIGNvaG9ydHMgPSAiR0lBTlQiLAogICAgaXNfbWV0YV9hbmFseXNpcyA9ICJUUlVFIiwKICAgIHBvcHVsYXRpb25fZGVzY3JpcHRvciA9ICJyZXBvcnRlZCBhbmNlc3RyeSIsCiAgICBwb3B1bGF0aW9uX2xhYmVscyA9ICJFdXJvcGVhbiB8IFNvdXRoIEFzaWFuIHwgQWZyaWNhbiB8IEVhc3QgQXNpYW4gfCBIaXNwYW5pYyIsCiAgICBwb3B1bGF0aW9uX3Byb3BvcnRpb25zID0gIjg1IHwgNiB8IDUgfCAyIHwgMiIsCiAgICBjb3VudHJpZXNfb2ZfcmVjcnVpdG1lbnQgPSAiQXVzdHJhbGlhIHwgQmFuZ2xhZGVzaCB8IENoaW5hIHwgRGVubWFyayB8IEVzdG9uaWEgfCBGaW5ubGFuZCB8IEdlcm1hbnkgfCBHcmVlY2UgfCBJY2VsYW5kIHwgSXJlbGFuZCB8IEl0YWx5IHwgTmV0aGVybGFuZHMgfCBOb3J3YXkgfCBQaGlsaXBwaW5lcyB8IFN3ZWRlbiB8IFRhaXdhbiB8IFVLIHwgVVNBIiwKICAgIGFuYWx5c2lzX21ldGhvZCA9ICJzY29yZS1zdGF0aXN0aWNzLWJhc2VkIGFzc29jaWF0aW9uIGFuYWx5c2lzIiwKICAgIGFuYWx5c2lzX3NvZnR3YXJlID0gIlJBUkVNRVRBTFdPUktFUiB8IFJWVEVTVCIKKQphbmFseXNpcyA8LSB0aWJibGUoZmllbGQ9bmFtZXMoZmllbGRzKSwKICAgICAgICAgICAgICAgICAgIHZhbHVlPXVubGlzdChmaWVsZHMpKQpgYGAKCgpUaGUgR0lBTlQgZGF0YSBpcyBkaXN0cmlidXRlZCBhcyBhIHNpbmdsZSBmaWxlIHBlciBhbmFseXNpcy4gV2UgcmVhZCBpbiB0aGUgZGF0YSwgdXBkYXRlIHRoZSBjb2x1bW4gbmFtZXMgc28gaXQgY29uZm9ybXMgdG8gdGhlIFBSSU1FRCBkYXRhIG1vZGVsLCBhbmQgd3JpdGUgZWFjaCBjaHJvbW9zb21lIHRvIGEgc2VwYXJhdGUgZmlsZS4KCmBgYHtyfQpidWNrZXQgPC0gYXZidWNrZXQoKQpibWlfZmlsZSA8LSAiQk1JX0FsbF9hbmNlc3RyeS50eHQiCmdzdXRpbF9jcChmaWxlLnBhdGgoYnVja2V0LCBibWlfZmlsZSksICIuIikKYGBgCgpgYGB7cn0KZGF0IDwtIHJlYWRfdHN2KGJtaV9maWxlKQpoZWFkKGRhdCkKYGBgCgpgYGB7cn0KZGF0MiA8LSBkYXQgJT4lCiAgcmVuYW1lKGNocm9tb3NvbWUgPSBDSFIsCiAgICAgICAgIHBvc2l0aW9uID0gUE9TLAogICAgICAgICByc0lEID0gU05QTkFNRSwKICAgICAgICAgcmVmX2FsbGVsZSA9IFJFRiwKICAgICAgICAgYWx0X2FsbGVsZSA9IEFMVCwKICAgICAgICAgcF92YWx1ZSA9IFB2YWx1ZSkgJT4lCiAgbXV0YXRlKHN0cmFuZCA9ICIrIiwKICAgICAgICAgZWZmZWN0X2FsbGVsZSA9IHJlZl9hbGxlbGUsCiAgICAgICAgIG90aGVyX2FsbGVsZSA9IGFsdF9hbGxlbGUsCiAgICAgICAgIGlzX2ltcHV0ZWQgPSBGQUxTRSkKYGBgCgpGaWxlcyBhcmUgbGlua2VkIHRvIGFuYWx5c2VzLiBUaGUgbWQ1IGhhc2ggb2YgZWFjaCBmaWxlIGlzIHVzZWQgdG8gZ2VuZXJhdGUgdGhlIHByaW1hcnkga2V5IGZvciB0aGUgJ2ZpbGUnIHRhYmxlLiBUaGUgbWQ1IHNob3VsZCBiZSBjb21wdXRlZCBiZWZvcmUgdXBsb2FkaW5nIHRvIHRoZSB3b3Jrc3BhY2UgYW5kIGNoZWNrZWQgYWZ0ZXIgdXBsb2FkIHRvIG1ha2Ugc3VyZSB0aGUgdXBsb2FkIHdhcyBzdWNjZXNzZnVsLiAKCmBgYHtyfQpjaHJzIDwtIHVuaXF1ZShkYXQyJGNocm9tb3NvbWUpCmNocl9maWxlIDwtIGxhcHBseShjaHJzLCBmdW5jdGlvbihjKSB7CiAgICAjIHNlbGVjdCBvbmx5IHRoaXMgY2hyb21vc29tZQogICAgZGF0X2NociA8LSBmaWx0ZXIoZGF0MiwgY2hyb21vc29tZSA9PSBjKQogICAgY2hyX2ZpbGUgPC0gcGFzdGUwKCJCTUlfQWxsX2FuY2VzdHJ5X2NociIsIGMsICIudHh0IikKICAgIAogICAgIyB3cml0ZSB0c3YgZmlsZQogICAgd3JpdGVfdHN2KGRhdF9jaHIsIGZpbGU9Y2hyX2ZpbGUsIG5hPSIiKQogICAgCiAgICAjIGdldCBtZDVzdW0gZm9yIGZpbGUgdGFibGUKICAgIG1kNSA8LSB0b29sczo6bWQ1c3VtKGNocl9maWxlKQogICAgCiAgICAjIGNvcHkgZmlsZSB0byBnb29nbGUgYnVja2V0CiAgICBnc3V0aWxfY3AoY2hyX2ZpbGUsIGJ1Y2tldCkKICAgIAogICAgIyMgcmV0dXJuIHJvdyBmb3IgZmlsZSB0YWJsZQogICAgdGliYmxlKAogICAgICAgIG1kNXN1bSA9IG1kNSwKICAgICAgICBuX3ZhcmlhbnRzID0gbnJvdyhkYXRfY2hyKSwKICAgICAgICBmaWxlX3BhdGggPSBmaWxlLnBhdGgoYnVja2V0LCBjaHJfZmlsZSksCiAgICAgICAgZmlsZV90eXBlID0gImRhdGEiLAogICAgICAgIGNocm9tb3NvbWUgPSBjCiAgICApCn0pCmZpbGVfdGFibGUgPC0gYmluZF9yb3dzKGNocl9maWxlKQpgYGAKCgojIyBXcml0ZSB0YWJsZXMgYXMgZmlsZXMgdG8gd29ya3NwYWNlIGJ1Y2tldAoKVG8gY2hlY2sgdGhlIHRhYmxlcyB1c2luZyBhIHdvcmtmbG93LCB0aGV5IG11c3QgYmUgd3JpdHRlbiBhcyBmaWxlcyB0byB0aGUgd29ya3NwYWNlIGJ1Y2tldC4KCmBgYHtyfQpvdXRmaWxlIDwtICJHSUFOVF9CTUlfZmlsZV90YWJsZS50c3YiCndyaXRlX3RzdihmaWxlX3RhYmxlLCBvdXRmaWxlKQpnc3V0aWxfY3Aob3V0ZmlsZSwgYnVja2V0KQoKb3V0ZmlsZSA8LSAiR0lBTlRfQk1JX2FuYWx5c2lzX3RhYmxlLnRzdiIKd3JpdGVfdHN2KGFuYWx5c2lzLCBvdXRmaWxlKQpnc3V0aWxfY3Aob3V0ZmlsZSwgYnVja2V0KQpgYGAKCgojIyBDaGVjayB0YWJsZXMgYWdhaW5zdCBkYXRhIG1vZGVsCgpPbmNlIGFsbCB0YWJsZXMgaGF2ZSBiZWVuIGNyZWF0ZWQsIHdlIGNhbiBjaGVjayB0aGF0IHRoZXkgY29uZm9ybSB0byB0aGUgZGF0YSBtb2RlbC4gVGhpcyBpcyBtb3N0IGVhc2lseSBhY2NvbXBsaXNoZWQgYnkgcHJvdmlkaW5nIHRoZSBwYXRocyB0byB0aGUgdGFibGVzIGluIFRTViBmb3JtYXQgYXMgaW5wdXQgdG8gdGhlIGB2YWxpZGF0ZV9nc3JfbW9kZWxgIHdvcmtmbG93LgpUaGlzIHdvcmtmbG93IHdpbGwgYWxzbyBjaGVjayB0aGF0IG1kNXN1bXMgaW4gdGhlIGBhc3NvY2lhdGlvbl9maWxlYCB0YWJsZSBtYXRjaCB0aGUgZmlsZXMgaW4gdGhlIGdvb2dsZSBidWNrZXQsCiBhbmQgdGhhdCBhbGwgZGF0YSBmaWxlcyBtYXRjaCB0aGUgZGF0YSBkaWN0aW9uYXJ5Lgo=