Skip to main content

Preparing Reference Files

A reference file is a dataset that contains detailed information about known genetic variants, including at least their genomic positions, reference alleles, and alternative alleles. These reference files are crucial for variant harmonisation, as they determine both whether and how your variants will be harmonised.

You can prepare the reference files needed in the harmonisation pipeline by either downloading them from our FTP server OR creating your own custom reference files.

Download from FTP

The current reference files used for harmonisation by the GWAS Catalog are variations in VCF format from Ensembl release 95 (released in January 2019). To run the pipeline, you need both the VCF and corresponding tabix files, as well as parquet files.

Example of reference files
./resources
|-- homo_sapiens-chr1.parquet
|-- homo_sapiens-chr1.vcf.gz
|-- homo_sapiens-chr1.vcf.gz.tbi

These files can be directly downloaded from our FTP server.

wget -r --no-parent --no-directories --continue ftp://ftp.ebi.ac.uk/pub/databases/gwas/harmonisation_resources/

Prepare your reference files

If you prefer to use your own VCF files as a reference, we recommend chunking your large VCF by chromosome to facilitate parallel processing in subsequent steps (you can use bcftools for this) and should name your VCF files using the pattern homo_sapiens-chr${chr}.vcf.gz. You can then prepare your own reference by running our pipeline.

For example, to use the Ensembl release 109, run the following command:

nextflow run  EBISPOT/gwas-sumstats-harmoniser -r $version \
--reference \
--ref 'full_path_to_save_reference' \
--remote_vcf_location 'ftp://ftp.ensembl.org/pub/release-109/variation/vcf/homo_sapiens/' \
-profile executor,singularity

Parameters Explained:

ParameterDescription
--referencePrepares the reference model. Downloads VCF files matching homo_sapiens-${chr}.vcf.gz from --remote_vcf_location, then generates tabix and parquet files.
--refSpecifies the directory where reference files will be stored.
--remote_vcf_locationDefines the source of the reference VCF files. Default is ftp://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens. In this case, it points to Ensembl release 109.
-profileA profile is a set of configuration attributes that can be selected during pipeline execution. Available profiles include test (quick run on local), standard (local executor), and executor (based on ./config/executor.config with default SLURM). Container options are conda, docker, and singularity.
-chromRuns the pipeline for a specific chromosome. For example, to download only homo_sapiens-chr22.vcf.gz, use --chrom 22
-chromlistRuns the pipeline for multiple chromosomes. For example, use --chromlist 22,X,Y to prepare reference for chr22, chrX and chrY

Preparing these references generally requires a large amount of memory. If you choose to run the pipeline with your own reference files, we recommend using an HPC environment. For example, preparing references for Ensembl release 95 requires around 50 GB of memory.