We have uploaded HapMap data in PLINK format (bed/bim/fam) to this
workspace. We then ran the plink2_bed2vcf
workflow to
convert the PLINK genotypes to a VCF file. In this notebook, we prepare
the data from these files for importing into data tables.
# get the latest version of AnVIL from github
#remotes::install_github("UW-GAC/AnvilDataModels")
library(AnVIL)
library(AnvilDataModels)
library(tidyverse)
To access files in this workspace, we need the google bucket ID,
which is returned by the avbucket
function.
gsutil_ls(bucket)
would list all the files in the
bucket.
(bucket <- avbucket())
[1] "gs://fc-d57f9c4e-210f-46d6-a9a9-1a3071fa5fff"
To read from a google bucket, we use the gsutil_pipe
function. We need to specify "rb"
as “read from binary”. An
alternative is to copy the file to the local instance with
gsutil_cp
.
prefix <- "hapmap3_r3_b37_fwd.consensus.qc.poly_Ilmn1M"
famfile <- paste0(bucket, "/", prefix, ".fam")
famfile %in% gsutil_ls(bucket)
[1] TRUE
fampipe <- gsutil_pipe(famfile, "rb")
fam <- read_table(fampipe, col_names=c("family", "indiv", "father", "mother", "sex", "phen"), col_types="cccccc")
head(fam)
Create subject table
subject <- fam %>%
mutate(reported_sex=c("1"="Male", "2"="Female")[sex]) %>%
select(subject_id=indiv, reported_sex) %>%
mutate(consent_code="NRUP",
study_nickname="HapMap",
dbgap_submission=FALSE)
head(subject)
Create sample table. In this example we use the same identifiers for subject and sample, but different values for each are preferred.
sample <- fam %>%
select(sample_id=indiv) %>%
mutate(subject_id=sample_id,
tissue_source="cell line")
head(sample)
Define sample set to link to genotype data. We will create two sets, one with all samples (recommended for inclusion in every workspace), and one with 100 samples that we will call “set1”.
sample_set <- create_set_all(sample, table_name="sample")
sample_set_100 <- tibble(sample_set_id="set1", sample_id=sample$sample_id[1:100])
sample_set <- bind_rows(sample_set, sample_set_100)
head(sample_set)
tail(sample_set)
count(sample_set, sample_set_id)
Each dataset is linked to a sample_set, but the same sample set may correspond to multiple datasets (such as array data and imputed data).
Metadata describing the array is stored in the array_dataset table. We save this as a set of “field” and “value” pairs for input to the workflow that assigns a unique identifier for each dataset.
array_fields <- list(
sample_set_id = "all",
genotyping_center = "Wellcome Trust Sanger Institute",
array_manufacturer = "Illumina",
array_name = "Human 1M",
genotype_calling_software = "BeadStudio",
reference_assembly = "GRCh37"
)
array_dataset <- tibble(field=names(array_fields),
value=unlist(array_fields))
Files are linked to datasets. The md5 hash of each file is used to
generate the primary key for the ‘file’ table. The md5 should be
computed before uploading to the workspace. Later, we will use the
check_md5
workflow to make sure the upload was
successful.
files <- paste0(bucket, "/", prefix, c(".bed", ".bim", ".fam"))
md5 <- c("ec6096edea0d6f46191a0275577b3f02",
"5a1e4276783afa0a235f907edae1dae3",
"4d9651bb9e45054dc8ed8c1c59cba19d")
array_file <- tibble(md5sum = md5,
file_path = files,
file_type = c("PLINK bed", "PLINK bim", "PLINK fam"))
In addition to the PLINK files, we add the converted and lifted over
VCFs to the file table. The workflows plink2_bed2vcf
and
liftover_vcf
output the md5sum along with VCF files.
array_file <- array_file %>%
bind_rows(tibble(md5sum = c("644afbb696822d378c2493fb4d06e389",
"bc906202e47eef92ec5df7939f30c189"),
file_path = paste0(bucket, c(
"/submissions/bcf1e11b-4836-4f27-9055-f91e7bb579b1/plink2_bed2vcf/393c7eb0-5fbf-4d35-9815-cd800b0c0793/call-results/cacheCopy/hapmap3_r3_Ilmn1M_hg19.vcf.gz",
"/submissions/2369a243-179e-4af6-a518-c4c29900008b/liftover_vcf/50004a72-e76e-4899-a5b6-9c927b957c2a/call-merge_vcf/hapmap3_r3_Ilmn1M_hg38.vcf.gz")),
file_type = "VCF"))
To check the tables using a workflow, they must be written as files to the workspace bucket.
table_names <- c("subject", "sample", "sample_set", "array_dataset", "array_file")
for (t in table_names) {
outfile <- paste0("HapMap_", t, "_table.tsv")
write_tsv(get(t), outfile)
gsutil_cp(outfile, bucket)
}
Once all tables have been created, we can check that they conform to
the data model. This is most easily accomplished by providing the paths
to the tables in TSV format as input to the
validate_genotype_model
workflow.