Preparing Reference Files
A reference file is a dataset that contains detailed information about known genetic variants, including at least their genomic positions, reference alleles, and alternative alleles. These reference files are crucial for variant harmonisation, as they determine both whether and how your variants will be harmonised.
You can prepare the reference files needed in the harmonisation pipeline by either downloading them from our FTP server OR creating your own custom reference files.
Download from FTP
The current reference files used for harmonisation by the GWAS Catalog are variations in VCF format from Ensembl release 95 (released in January 2019). To run the pipeline, you need both the VCF
and corresponding tabix
files, as well as parquet
files.
Example of reference files
./resources
|-- homo_sapiens-chr1.parquet
|-- homo_sapiens-chr1.vcf.gz
|-- homo_sapiens-chr1.vcf.gz.tbi
These files can be directly downloaded from our FTP server.
wget -r --no-parent --no-directories --continue ftp://ftp.ebi.ac.uk/pub/databases/gwas/harmonisation_resources/
Prepare your reference files
If you prefer to use your own VCF files as a reference, we recommend chunking your large VCF by chromosome to facilitate parallel processing in subsequent steps (you can use bcftools
for this) and should name your VCF files using the pattern homo_sapiens-chr${chr}.vcf.gz
. You can then prepare your own reference by running our pipeline.
For example, to use the Ensembl release 109, run the following command:
nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--reference \
--ref 'full_path_to_save_reference' \
--remote_vcf_location 'ftp://ftp.ensembl.org/pub/release-109/variation/vcf/homo_sapiens/' \
-profile executor,singularity
Parameters Explained:
Parameter | Description |
---|---|
--reference | Prepares the reference model. Downloads VCF files matching homo_sapiens-${chr}.vcf.gz from --remote_vcf_location , then generates tabix and parquet files. |
--ref | Specifies the directory where reference files will be stored. |
--remote_vcf_location | Defines the source of the reference VCF files. Default is ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens . In this case, it points to Ensembl release 109. |
-profile | A profile is a set of configuration attributes that can be selected during pipeline execution. Available profiles include test (quick run on local), standard (local executor), and executor (based on ./config/executor.config with default SLURM). Container options are conda , docker , and singularity . |
-chrom | Runs the pipeline for a specific chromosome. For example, to download only homo_sapiens-chr22.vcf.gz , use --chrom 22 |
-chromlist | Runs the pipeline for multiple chromosomes. For example, use --chromlist 22,X,Y to prepare reference for chr22, chrX and chrY |
Preparing these references generally requires a large amount of memory. If you choose to run the pipeline with your own reference files, we recommend using an HPC environment. For example, preparing references for Ensembl release 95 requires around 50 GB of memory.