Integrating Variants with Annotations using Glow(Python)
Loading...

Integrating Variants with Annotations using Glow

Glow addeds support for loading GFF3 files, which are commonly used to store annotations on genomic regions. When combined with variant data, genomic annotations provide context for each change in the genome. Does this mutation cause a change in the protein coding sequence of gene? If so, how does it change the protein? Or is the mutation in a low information part of the genome, also known as “junk DNA”. And everything in between. In this notebook, we demonstrate how to use Glow's APIs to work with annotations from the RefSeq database alongside genomic variants from the 1000 Genomes project.

To start, we will download annotations for the GRCh38 human reference genome from RefSeq and define the paths that we will load our data from.

from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.dataframe import *

# Human genome annotations in GFF3 are available at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.39_GRCh38.p13/
gff_path = "/databricks-datasets/genomics/gffs/GCF_000001405.39_GRCh38.p13_genomic.gff.bgz"

vcf_path = "/databricks-datasets/genomics/1kg-vcfs/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz"

Load in GFF file

Here, we load in the GFF file as an Apache Spark dataframe, using Glow's GFF reader. We can then filter down to all annotations that are on NC_000022.11, which is the accession for chromosome 22 in the GRCh38 patch 13 human genome reference build.

annotations_df = spark.read \
  .format('gff') \
  .load(gff_path) \
  .filter("seqid = 'NC_000022.11'") \
  .alias('annotations_df')

annotations_df.printSchema()
root |-- seqId: string (nullable = true) |-- source: string (nullable = true) |-- type: string (nullable = true) |-- start: long (nullable = true) |-- end: long (nullable = true) |-- score: double (nullable = true) |-- strand: string (nullable = true) |-- phase: integer (nullable = true) |-- ID: string (nullable = true) |-- Name: string (nullable = true) |-- Parent: array (nullable = true) | |-- element: string (containsNull = true) |-- Target: string (nullable = true) |-- Gap: string (nullable = true) |-- Note: array (nullable = true) | |-- element: string (containsNull = true) |-- Dbxref: array (nullable = true) | |-- element: string (containsNull = true) |-- Is_circular: boolean (nullable = true) |-- align_id: string (nullable = true) |-- allele: string (nullable = true) |-- anticodon: string (nullable = true) |-- assembly_bases_aln: string (nullable = true) |-- assembly_bases_seq: string (nullable = true) |-- batch_id: string (nullable = true) |-- bit_score: string (nullable = true) |-- blast_aligner: string (nullable = true) |-- blast_score: string (nullable = true) |-- bound_moiety: string (nullable = true) |-- chromosome: string (nullable = true) |-- codons: string (nullable = true) |-- common_component: string (nullable = true) |-- consensus_splices: string (nullable = true) |-- country: string (nullable = true) |-- crc32: string (nullable = true) |-- curated_alignment: string (nullable = true) |-- description: string (nullable = true) |-- direction: string (nullable = true) |-- e_value: string (nullable = true) |-- end_range: string (nullable = true) |-- exception: string (nullable = true) |-- exon_identity: string (nullable = true) |-- exon_number: string (nullable = true) |-- experiment: string (nullable = true) |-- feat_class: string (nullable = true) |-- filter_score: string (nullable = true) |-- for_remapping: string (nullable = true) |-- function: string (nullable = true) |-- gap_count: string (nullable = true) |-- gbkey: string (nullable = true) |-- gene: string (nullable = true) |-- gene_biotype: string (nullable = true) |-- gene_synonym: string (nullable = true) |-- genome: string (nullable = true) |-- hsp_percent_coverage: string (nullable = true) |-- identity: string (nullable = true) |-- idty: string (nullable = true) |-- inference: string (nullable = true) |-- isolation-source: string (nullable = true) |-- lxr_locAcc_currStat_120: string (nullable = true) |-- lxr_locAcc_currStat_35: string (nullable = true) |-- map: string (nullable = true) |-- matchable_bases: string (nullable = true) |-- matched_bases: string (nullable = true) |-- matches: string (nullable = true) |-- merge_aligner: string (nullable = true) |-- mobile_element_type: string (nullable = true) |-- model_evidence: string (nullable = true) |-- mol_type: string (nullable = true) |-- not_for_annotation: string (nullable = true) |-- num_ident: string (nullable = true) |-- num_mismatch: string (nullable = true) |-- number: string (nullable = true) |-- part: string (nullable = true) |-- partial: string (nullable = true) |-- pct_coverage: string (nullable = true) |-- pct_coverage_hiqual: string (nullable = true) |-- pct_identity_gap: string (nullable = true) |-- pct_identity_gapopen_only: string (nullable = true) |-- pct_identity_ungap: string (nullable = true) |-- product: string (nullable = true) |-- product_coverage: string (nullable = true) |-- promoted_rank: string (nullable = true) |-- protein_id: string (nullable = true) |-- pseudo: string (nullable = true) |-- qtaxid: string (nullable = true) |-- rank: string (nullable = true) |-- recombination_class: string (nullable = true) |-- regulatory_class: string (nullable = true) |-- rpt_family: string (nullable = true) |-- rpt_type: string (nullable = true) |-- rpt_unit_range: string (nullable = true) |-- rpt_unit_seq: string (nullable = true) |-- satellite: string (nullable = true) |-- splices: string (nullable = true) |-- standard_name: string (nullable = true) |-- start_range: string (nullable = true) |-- tag: string (nullable = true) |-- tissue-type: string (nullable = true) |-- transcript_id: string (nullable = true) |-- transl_except: string (nullable = true) |-- transl_table: string (nullable = true) |-- weighted_identity: string (nullable = true)
display(annotations_df)
 
seqId
source
type
start
end
score
strand
phase
ID
Name
Parent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
NC_000022.11
RefSeq
region
0
50818468
null
+
null
NC_000022.11:1..50818468
22
null
NC_000022.11
Gnomon
pseudogene
10580456
10580988
null
+
null
gene-LOC100996364
LOC100996364
null
NC_000022.11
Gnomon
exon
10580456
10580988
null
+
null
id-LOC100996364
null
["gene-LOC100996364"]
NC_000022.11
Gnomon
gene
10742023
10753062
null
+
null
gene-LOC105379418
LOC105379418
null
NC_000022.11
Gnomon
lnc_RNA
10742023
10753062
null
+
null
rna-XR_950597.3
XR_950597.3
["gene-LOC105379418"]
NC_000022.11
Gnomon
exon
10742023
10742191
null
+
null
exon-XR_950597.3-1
null
["rna-XR_950597.3"]
NC_000022.11
Gnomon
exon
10752759
10753062
null
+
null
exon-XR_950597.3-2
null
["rna-XR_950597.3"]
NC_000022.11
Gnomon
lnc_RNA
10742023
10753053
null
+
null
rna-XR_950596.3
XR_950596.3
["gene-LOC105379418"]
NC_000022.11
Gnomon
exon
10742023
10742191
null
+
null
exon-XR_950596.3-1
null
["rna-XR_950596.3"]
NC_000022.11
Gnomon
exon
10749497
10749658
null
+
null
exon-XR_950596.3-2
null
["rna-XR_950596.3"]
NC_000022.11
Gnomon
exon
10752759
10753053
null
+
null
exon-XR_950596.3-3
null
["rna-XR_950596.3"]
NC_000022.11
Gnomon
pseudogene
10858994
10864475
null
+
null
gene-LOC100289194
LOC100289194
null
NC_000022.11
Gnomon
exon
10858994
10859105
null
+
null
id-LOC100289194
null
["gene-LOC100289194"]
NC_000022.11
Gnomon
exon
10859694
10859832
null
+
null
id-LOC100289194-2
null
["gene-LOC100289194"]
NC_000022.11
Gnomon
exon
10861764
10861890
null
+
null
id-LOC100289194-3
null
["gene-LOC100289194"]
NC_000022.11
Gnomon
exon
10863367
10863581
null
+
null
id-LOC100289194-4
null
["gene-LOC100289194"]
NC_000022.11
Gnomon
exon
10863721
10864475
null
+
null
id-LOC100289194-5
null
["gene-LOC100289194"]
NC_000022.11
BestRefSeq
pseudogene
10940596
10961529
null
-
null
gene-FRG1FP
FRG1FP
null
NC_000022.11
BestRefSeq
transcript
10940596
10961529
null
-
null
rna-NR_132320.1
NR_132320.1
["gene-FRG1FP"]
NC_000022.11
BestRefSeq
exon
10961282
10961529
null
-
null
exon-NR_132320.1-1
null
["rna-NR_132320.1"]
NC_000022.11
BestRefSeq
exon
10960331
10960431
null
-
null
exon-NR_132320.1-2
null
["rna-NR_132320.1"]
NC_000022.11
BestRefSeq
exon
10959066
10959136
null
-
null
exon-NR_132320.1-3
null
["rna-NR_132320.1"]
NC_000022.11
BestRefSeq
exon
10950048
10950174
null
-
null
exon-NR_132320.1-4
null
["rna-NR_132320.1"]
NC_000022.11
BestRefSeq
exon
10949211
10949269
null
-
null
exon-NR_132320.1-5
null
["rna-NR_132320.1"]
NC_000022.11
BestRefSeq
exon
10947303
10947418
null
-
null
exon-NR_132320.1-6
null
["rna-NR_132320.1"]

Showing the first 1000 rows.

Load in variants

Now, we'll load in genotype data from 1000 Genomes. We will combine this data later with our annotations, so we can connect variants and gene transcripts.

variants_df = spark.read \
  .format("vcf") \
  .load(vcf_path) \
  .drop('genotypes') \
  .alias('variants_df')

display(variants_df)
 
contigName
start
end
names
referenceAllele
alternateAlleles
qual
filters
splitFromMultiAllelic
INFO_AC
INFO_NS
INFO_AFR_AF
INFO_VT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
10516172
10516173
[]
A
["G"]
null
["PASS"]
false
[121]
2548
[0.06]
["SNP"]
22
10522216
10522217
[]
G
["A"]
null
["PASS"]
false
[89]
2548
[0.07]
["SNP"]
22
10526444
10526445
[]
A
["G"]
null
["PASS"]
false
[4948]
2548
[0.93]
["SNP"]
22
10527033
10527034
[]
G
["T"]
null
["PASS"]
false
[271]
2548
[0.11]
["SNP"]
22
10527037
10527038
[]
C
["G"]
null
["PASS"]
false
[267]
2548
[0.01]
["SNP"]
22
10527073
10527074
[]
A
["G"]
null
["PASS"]
false
[104]
2548
[0]
["SNP"]
22
10530661
10530662
[]
A
["G"]
null
["PASS"]
false
[30]
2548
[0]
["SNP"]
22
10530666
10530667
[]
G
["A"]
null
["PASS"]
false
[1]
2548
[0]
["SNP"]
22
10530666
10530668
[]
GA
["G"]
null
["PASS"]
false
[4]
2548
[0]
["INDEL"]
22
10530698
10530699
[]
A
["G"]
null
["PASS"]
false
[131]
2548
[0.09]
["SNP"]
22
10530753
10530754
[]
A
["C"]
null
["PASS"]
false
[50]
2548
[0]
["SNP"]
22
10530758
10530759
[]
A
["G"]
null
["PASS"]
false
[7]
2548
[0.01]
["SNP"]
22
10531287
10531288
[]
A
["C"]
null
["PASS"]
false
[50]
2548
[0.03]
["SNP"]
22
10531733
10531734
[]
G
["T"]
null
["PASS"]
false
[36]
2548
[0.03]
["SNP"]
22
10531783
10531785
[]
GA
["G"]
null
["PASS"]
false
[17]
2548
[0]
["INDEL"]
22
10531788
10531789
[]
G
["GATGAA"]
null
["PASS"]
false
[1754]
2548
[0.42]
["INDEL"]
22
10561434
10561435
[]
A
["T"]
null
["PASS"]
false
[31]
2548
[0]
["SNP"]
22
10561461
10561462
[]
A
["T"]
null
["PASS"]
false
[9]
2548
[0.01]
["SNP"]
22
10561493
10561494
[]
A
["T"]
null
["PASS"]
false
[2]
2548
[0]
["SNP"]
22
10561502
10561503
[]
G
["A"]
null
["PASS"]
false
[51]
2548
[0]
["SNP"]
22
10561515
10561516
[]
G
["T"]
null
["PASS"]
false
[45]
2548
[0.03]
["SNP"]

Showing the first 1000 rows.

Resolving parent/child relationships in annotation data

Annotation data often contains parent/child relationships, for example, an exon (child) is a part of a gene (parent). To resolve these relationships, we can use a self-join in Spark SQL.

from pyspark.sql.functions import *

parent_child_df = annotations_df \
.join(
  annotations_df.select('id', 'type', 'name', 'start', 'end').alias('parent_df'),
  col('annotations_df.parent')[0] == col('parent_df.id') # each row in annotation_df has at most one parent
) \
.orderBy('annotations_df.start', 'annotations_df.end') \
.select(
  'annotations_df.seqid',
  'annotations_df.type',
  'annotations_df.start',
  'annotations_df.end',
  'annotations_df.id',
  'annotations_df.name',
  col('annotations_df.parent')[0].alias('parent_id'),
  col('parent_df.Name').alias('parent_name'),
  col('parent_df.type').alias('parent_type'),
  col('parent_df.start').alias('parent_start'),
  col('parent_df.end').alias('parent_end')
) \
.alias('parent_child_df')

display(parent_child_df)                             
 
seqid
type
start
end
id
name
parent_id
parent_name
parent_type
parent_start
parent_end
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
NC_000022.11
exon
10580456
10580988
id-LOC100996364
null
gene-LOC100996364
LOC100996364
pseudogene
10580456
10580988
NC_000022.11
exon
10742023
10742191
exon-XR_950596.3-1
null
rna-XR_950596.3
XR_950596.3
lnc_RNA
10742023
10753053
NC_000022.11
exon
10742023
10742191
exon-XR_950597.3-1
null
rna-XR_950597.3
XR_950597.3
lnc_RNA
10742023
10753062
NC_000022.11
lnc_RNA
10742023
10753053
rna-XR_950596.3
XR_950596.3
gene-LOC105379418
LOC105379418
gene
10742023
10753062
NC_000022.11
lnc_RNA
10742023
10753062
rna-XR_950597.3
XR_950597.3
gene-LOC105379418
LOC105379418
gene
10742023
10753062
NC_000022.11
exon
10749497
10749658
exon-XR_950596.3-2
null
rna-XR_950596.3
XR_950596.3
lnc_RNA
10742023
10753053
NC_000022.11
exon
10752759
10753053
exon-XR_950596.3-3
null
rna-XR_950596.3
XR_950596.3
lnc_RNA
10742023
10753053
NC_000022.11
exon
10752759
10753062
exon-XR_950597.3-2
null
rna-XR_950597.3
XR_950597.3
lnc_RNA
10742023
10753062
NC_000022.11
exon
10858994
10859105
id-LOC100289194
null
gene-LOC100289194
LOC100289194
pseudogene
10858994
10864475
NC_000022.11
exon
10859694
10859832
id-LOC100289194-2
null
gene-LOC100289194
LOC100289194
pseudogene
10858994
10864475
NC_000022.11
exon
10861764
10861890
id-LOC100289194-3
null
gene-LOC100289194
LOC100289194
pseudogene
10858994
10864475
NC_000022.11
exon
10863367
10863581
id-LOC100289194-4
null
gene-LOC100289194
LOC100289194
pseudogene
10858994
10864475
NC_000022.11
exon
10863721
10864475
id-LOC100289194-5
null
gene-LOC100289194
LOC100289194
pseudogene
10858994
10864475
NC_000022.11
exon
10940596
10940707
exon-NR_132320.1-9
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
transcript
10940596
10961529
rna-NR_132320.1
NR_132320.1
gene-FRG1FP
FRG1FP
pseudogene
10940596
10961529
NC_000022.11
exon
10941688
10941780
exon-NR_132320.1-8
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10944966
10945053
exon-NR_132320.1-7
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10947303
10947418
exon-NR_132320.1-6
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10949211
10949269
exon-NR_132320.1-5
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10950048
10950174
exon-NR_132320.1-4
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10959066
10959136
exon-NR_132320.1-3
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10960331
10960431
exon-NR_132320.1-2
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
10961282
10961529
exon-NR_132320.1-1
null
rna-NR_132320.1
NR_132320.1
transcript
10940596
10961529
NC_000022.11
exon
11250651
11250945
id-LOC102723688
null
gene-LOC102723688
LOC102723688
pseudogene
11250651
11256506
NC_000022.11
exon
11252853
11253227
id-LOC102723688-2
null
gene-LOC102723688
LOC102723688
pseudogene
11250651
11256506
NC_000022.11
exon
11253267
11254142
id-LOC102723688-3
null
gene-LOC102723688
LOC102723688
pseudogene
11250651
11256506
NC_000022.11
exon
11255992
11256257
id-LOC102723688-4
null
gene-LOC102723688
LOC102723688
pseudogene
11250651
11256506

Showing the first 1000 rows.

We can generalize this process with a function that runs the self-join, filters down to a specified parent/child relationship. To link all of the children back to the parent feature, we finally run a group-by.