-
Notifications
You must be signed in to change notification settings - Fork 15
Normalized variant representation
The flexible VCF format results in multiple ways to represent variants. When performing comparisons of calls between multiple callers or sequencing technologies, it is critical that we're able to ensure uniform variant representation to avoid discordance calls due to representation.
For additional background on normalization, Adrian Tan wrote up the approaches used for normalization within vt.
We resolve these issues via a normalization process which does the following:
-
Converts all naming and coordinates into the standard NCBI/Ensembl convention from UCSC (chr1 to 1). In addition to flexibly remapping chromosome names, this handles reordering to standard conventions used within GATK.
-
Reduces all MNPs and complex variants into individual phased variants. Multiple nucleotide polymorphisms (MNPs) place multiple phased variants together into a single call representation. For example, this MNP:
MT 150 . TCT CCC . PASS . GT 1/1
can be equivalently expressed as:
MT 150 . T C . PASS . GT 1/1 MT 152 . T C . PASS . GT 1|1
The normalize process converts all MNPs into the latter case.
-
Indels next to variants or in repetitive regions have multiple correct representations. For instance, an AG -> C deletion can be:
TAG TC-
or:
TAG T-C
We follow the convention of left-aligning these variants, and convert all of these to the second case. Similar to MNP normalization, this treats the changes as two separate variants for comparison. In the example we have a T -> TA and G/C change, instead of a AG -> C combined deletion and nucleotide change.
This process also handles left-aligning indels in repetitive regions using GATK's LeftAlignVariants tool.
-
Trims extra reference base padding in indels. Some callers will add extra padded reference bases on indels. We remove these extra bases and adjust coordinates correctly to keep a single matching reference base. So a padded variant like:
1 237528 . CAAAAAAAAAAAAAAAA CAAAAAAAAAAAAAAAAAAAAA
will be:
1 237544 . A AAAAAA