📄️ Overview of the Harmonisation Pipeline
The gwas-sumstats-harmoniser is a pipeline designed to standardise variant data across different genome assemblies, ensuring consistency for downstream analysis. This process involves four key steps:
📄️ Genome Build Mapping
The first step in harmonizing variant data is updating the genomic coordinates to the desired assembly (GRCh38). The pipeline follows a systematic approach to ensure high-confidence mapping of each variant's position:
📄️ Orientation of palindromic variants
Palindromic variants are genetic variants (such as A/T or G/C SNPs) that appear identical on both the forward and reverse DNA strands, making it difficult to distinguish their true orientation. To ensure that these variants are correctly aligned during harmonisation, the pipeline employs a method to infer their strand orientation using a strand consensus approach.
📄️ Harmonising the variants
Each variant is harmonised by aligning it with a reference dataset—specifically, the Ensembl VCF reference. This process ensures allele consistency and corrects the orientation of alleles to match the forward strand. Proper harmonization is critical for ensuring that all variants are aligned correctly and ready for downstream analysis.
📄️ Quality control
To ensure that only complete and reliable records are retained in the harmonised result, the final step of the pipeline is quality control (QC). The QC process involves filtering out variants that lack valid values for the following fields: