Harmonising a large number of sumstats
The start_harmonisation.sh script demonstrates how to run the harmonisation pipeline for a single summary statistics file. If you need to harmonise a large number of files, you can either run each sumstat as an independent Nextflow job using a loop, use the option --harm and --list to harmonise all files listed in a text file, or use the --gwascatalog and --all_harm_folder option to harmonise all files in a folder.
Harmonising files from a List
Preparing the list file
You can use a text file containing the full path of each summary statistics file as input for the --list. For example, in the file to_harmonise_files_list.txt, each line should include the full path to a summary statistics file. The pipeline will automatically find the corresponding YAML file in the same directory (e.g., full_path_to_gwas_sumstat_1-meta.yaml).
full_path_to_gwas_sumstat_1
full_path_to_gwas_sumstat_2
full_path_to_gwas_sumstat_3
full_path_to_gwas_sumstat_4
Running the pipeline using the --list
To harmonise all files listed in a provided to_harmonise_files_list.txt, you can use the option --list to provide the full path of the list.
nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref 'full path to store reference' \
--harm \
--list to_harmonise_files_list.txt \
-with-trace \
-profile executor,singularity \
Harmonising files from a folder
The --gwascatalog model is designed for the daily harmonisation of GWAS Catalog data. It includes an additional step to move the final results to the FTP folder specified by the --ftp option. If all variants in a file fail to harmonise, the file will be identified as a "failing harmonisation file," and only the corresponding log file will be moved to the FTP folder.
This model requires input from the --all_harm_folder option, which automatically harmonises all files within the specified all_harm_folder folder.
nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref $ref \
--gwascatalog \
--all_harm_folder $all_harm_folder \
--ftp $path_to_store_final_result \
-with-trace \
-profile executor,singularity
The --gwascatalog option will automatically delete the input file and any intermediate files, retaining only the final result in the FTP folder. Please ensure that the files in the all_harm_folder are copies.
Handling Multiple Files in One Job
When running multiple files in one Nextflow job, the pipeline is designed to ignore errors generated by individual files. If a file encounters an error, its process will halt, but it will not affect the progress of other summary statistics in the same job. Since different files may progress at different rates, we recommend using the -with-trace option to generate a trace file. This allows you to monitor the progress of each file individually.
Example of a trace file: trace-20241018-61875575.txt
| task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | a5/4abfe2 | 23589 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name) | COMPLETED | 0 | 2024-10-18 17:11:27.279 | 5.8s | 4.7s | 37.1% | 131.5 MB | 593.1 MB | 11.7 MB | 1.6 KB |
| 2 | 51/6248c5 | 24169 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:33.282 | 4.2s | 3.5s | 43.2% | 111.1 MB | 668.3 MB | 20.7 MB | 1.2 KB |
Each row in the table represents a single process executed on a specific summary statistics (sumstat) file. The name column provides details about the specific process and the sumstat being processed, allowing you to track the progress and performance of each step in the harmonisation workflow for individual sumstat files.
For more information, please refer to nextflow documentation for more details.
please click here to see an example of a full table for one sumstat
| task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | a5/4abfe2 | 23589 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name) | COMPLETED | 0 | 2024-10-18 17:11:27.279 | 5.8s | 4.7s | 37.1% | 131.5 MB | 593.1 MB | 11.7 MB | 1.6 KB |
| 3 | ed/c33b4b | 24192 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr1) | COMPLETED | 0 | 2024-10-18 17:11:33.383 | 4s | 3.4s | 40.9% | 112 MB | 668.3 MB | 20.7 MB | 796 B |
| 2 | 51/6248c5 | 24169 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:33.282 | 4.2s | 3.5s | 43.2% | 111.1 MB | 668.3 MB | 20.7 MB | 1.2 KB |
| 4 | 8b/e5118d | 24883 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts_sum (random_name) | COMPLETED | 0 | 2024-10-18 17:11:37.579 | 1.3s | 953ms | 88.4% | 9 MB | 14.8 MB | 9.9 MB | 666 B |
| 5 | 25/1f2003 | 25055 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr1) | COMPLETED | 0 | 2024-10-18 17:11:38.985 | 2.1s | 1.7s | 86.1% | 18.8 MB | 89.5 MB | 20.7 MB | 2.2 KB |
| 6 | 1a/d1d23e | 25305 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:41.136 | 1.9s | 1.7s | 88.6% | 16.7 MB | 87.8 MB | 20.7 MB | 1.3 KB |
| 7 | f6/d879a1 | 25551 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:concatenate_chr_splits (random_name) | COMPLETED | 0 | 2024-10-18 17:11:43.113 | 438ms | 23ms | 57.1% | 3.2 MB | 5.4 MB | 64.1 KB | 1.3 KB |
| 8 | c1/cba6ce | 25688 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:qc (random_name) | COMPLETED | 0 | 2024-10-18 17:11:43.605 | 948ms | 570ms | 51.9% | 8.9 MB | 14.8 MB | 2.9 MB | 2.1 KB |
| 9 | 73/4fcc1a | 25922 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:harmonization_log (random_name) | COMPLETED | 0 | 2024-10-18 17:11:44.583 | 2.2s | 1.8s | 61.8% | 97.1 MB | 574.5 MB | 11.5 MB | 25.5 KB |