Skip to main content

Harmonising a large number of sumstats


The start_harmonisation.sh script demonstrates how to run the harmonisation pipeline for a single summary statistics file. If you need to harmonise a large number of files, you can either run each sumstat as an independent Nextflow job using a loop, use the option --harm and --list to harmonise all files listed in a text file, or use the --gwascatalog and --all_harm_folder option to harmonise all files in a folder.

Harmonising files from a List

Preparing the list file

You can use a text file containing the full path of each summary statistics file as input for the --list. For example, in the file to_harmonise_files_list.txt, each line should include the full path to a summary statistics file. The pipeline will automatically find the corresponding YAML file in the same directory (e.g., full_path_to_gwas_sumstat_1-meta.yaml).

Example of to_harmonise_files_list.txt
full_path_to_gwas_sumstat_1
full_path_to_gwas_sumstat_2
full_path_to_gwas_sumstat_3
full_path_to_gwas_sumstat_4

Running the pipeline using the --list

To harmonise all files listed in a provided to_harmonise_files_list.txt, you can use the option --list to provide the full path of the list.

nextflow run  EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref 'full path to store reference' \
--harm \
--list to_harmonise_files_list.txt \
-with-trace \
-profile executor,singularity \

Harmonising files from a folder

The --gwascatalog model is designed for the daily harmonisation of GWAS Catalog data. It includes an additional step to move the final results to the FTP folder specified by the --ftp option. If all variants in a file fail to harmonise, the file will be identified as a "failing harmonisation file," and only the corresponding log file will be moved to the FTP folder.

This model requires input from the --all_harm_folder option, which automatically harmonises all files within the specified all_harm_folder folder.

nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref $ref \
--gwascatalog \
--all_harm_folder $all_harm_folder \
--ftp $path_to_store_final_result \
-with-trace \
-profile executor,singularity
warning

The --gwascatalog option will automatically delete the input file and any intermediate files, retaining only the final result in the FTP folder. Please ensure that the files in the all_harm_folder are copies.

Handling Multiple Files in One Job

When running multiple files in one Nextflow job, the pipeline is designed to ignore errors generated by individual files. If a file encounters an error, its process will halt, but it will not affect the progress of other summary statistics in the same job. Since different files may progress at different rates, we recommend using the -with-trace option to generate a trace file. This allows you to monitor the progress of each file individually.

Example of a trace file: trace-20241018-61875575.txt
task_idhashnative_idnamestatusexitsubmitdurationrealtime%cpupeak_rsspeak_vmemrcharwchar
1a5/4abfe223589NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name)COMPLETED02024-10-18 17:11:27.2795.8s4.7s37.1%131.5 MB593.1 MB11.7 MB1.6 KB
251/6248c524169NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22)COMPLETED02024-10-18 17:11:33.2824.2s3.5s43.2%111.1 MB668.3 MB20.7 MB1.2 KB

Each row in the table represents a single process executed on a specific summary statistics (sumstat) file. The name column provides details about the specific process and the sumstat being processed, allowing you to track the progress and performance of each step in the harmonisation workflow for individual sumstat files.

For more information, please refer to nextflow documentation for more details.

please click here to see an example of a full table for one sumstat
task_idhashnative_idnamestatusexitsubmitdurationrealtime%cpupeak_rsspeak_vmemrcharwchar
1a5/4abfe223589NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name)COMPLETED02024-10-18 17:11:27.2795.8s4.7s37.1%131.5 MB593.1 MB11.7 MB1.6 KB
3ed/c33b4b24192NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr1)COMPLETED02024-10-18 17:11:33.3834s3.4s40.9%112 MB668.3 MB20.7 MB796 B
251/6248c524169NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22)COMPLETED02024-10-18 17:11:33.2824.2s3.5s43.2%111.1 MB668.3 MB20.7 MB1.2 KB
48b/e5118d24883NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts_sum (random_name)COMPLETED02024-10-18 17:11:37.5791.3s953ms88.4%9 MB14.8 MB9.9 MB666 B
525/1f200325055NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr1)COMPLETED02024-10-18 17:11:38.9852.1s1.7s86.1%18.8 MB89.5 MB20.7 MB2.2 KB
61a/d1d23e25305NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr22)COMPLETED02024-10-18 17:11:41.1361.9s1.7s88.6%16.7 MB87.8 MB20.7 MB1.3 KB
7f6/d879a125551NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:concatenate_chr_splits (random_name)COMPLETED02024-10-18 17:11:43.113438ms23ms57.1%3.2 MB5.4 MB64.1 KB1.3 KB
8c1/cba6ce25688NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:qc (random_name)COMPLETED02024-10-18 17:11:43.605948ms570ms51.9%8.9 MB14.8 MB2.9 MB2.1 KB
973/4fcc1a25922NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:harmonization_log (random_name)COMPLETED02024-10-18 17:11:44.5832.2s1.8s61.8%97.1 MB574.5 MB11.5 MB25.5 KB