Harmonising a large number of sumstats
The start_harmonisation.sh
script demonstrates how to run the harmonisation pipeline for a single summary statistics file. If you need to harmonise a large number of files, you can either run each sumstat as an independent Nextflow job using a loop, use the option --harm
and --list
to harmonise all files listed in a text file, or use the --gwascatalog
and --all_harm_folder
option to harmonise all files in a folder.
Harmonising files from a List
Preparing the list file
You can use a text file containing the full path of each summary statistics file as input for the --list
. For example, in the file to_harmonise_files_list.txt
, each line should include the full path to a summary statistics file. The pipeline will automatically find the corresponding YAML file in the same directory (e.g., full_path_to_gwas_sumstat_1-meta.yaml
).
full_path_to_gwas_sumstat_1
full_path_to_gwas_sumstat_2
full_path_to_gwas_sumstat_3
full_path_to_gwas_sumstat_4
Running the pipeline using the --list
To harmonise all files listed in a provided to_harmonise_files_list.txt
, you can use the option --list
to provide the full path of the list.
nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref 'full path to store reference' \
--harm \
--list to_harmonise_files_list.txt \
-with-trace \
-profile executor,singularity \
Harmonising files from a folder
The --gwascatalog
model is designed for the daily harmonisation of GWAS Catalog data. It includes an additional step to move the final results to the FTP folder specified by the --ftp
option. If all variants in a file fail to harmonise, the file will be identified as a "failing harmonisation file," and only the corresponding log file will be moved to the FTP folder.
This model requires input from the --all_harm_folder
option, which automatically harmonises all files within the specified all_harm_folder
folder.
nextflow run EBISPOT/gwas-sumstats-harmoniser -r $version \
--ref $ref \
--gwascatalog \
--all_harm_folder $all_harm_folder \
--ftp $path_to_store_final_result \
-with-trace \
-profile executor,singularity
The --gwascatalog
option will automatically delete the input file and any intermediate files, retaining only the final result in the FTP folder. Please ensure that the files in the all_harm_folder are copies.
Handling Multiple Files in One Job
When running multiple files in one Nextflow job, the pipeline is designed to ignore errors generated by individual files. If a file encounters an error, its process will halt, but it will not affect the progress of other summary statistics in the same job. Since different files may progress at different rates, we recommend using the -with-trace
option to generate a trace file. This allows you to monitor the progress of each file individually.
Example of a trace file: trace-20241018-61875575.txt
task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | a5/4abfe2 | 23589 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name) | COMPLETED | 0 | 2024-10-18 17:11:27.279 | 5.8s | 4.7s | 37.1% | 131.5 MB | 593.1 MB | 11.7 MB | 1.6 KB |
2 | 51/6248c5 | 24169 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:33.282 | 4.2s | 3.5s | 43.2% | 111.1 MB | 668.3 MB | 20.7 MB | 1.2 KB |
Each row in the table represents a single process executed on a specific summary statistics (sumstat) file. The name
column provides details about the specific process and the sumstat being processed, allowing you to track the progress and performance of each step in the harmonisation workflow for individual sumstat files.
For more information, please refer to nextflow documentation for more details.
please click here to see an example of a full table for one sumstat
task_id | hash | native_id | name | status | exit | submit | duration | realtime | %cpu | peak_rss | peak_vmem | rchar | wchar |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | a5/4abfe2 | 23589 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:map_to_build (random_name) | COMPLETED | 0 | 2024-10-18 17:11:27.279 | 5.8s | 4.7s | 37.1% | 131.5 MB | 593.1 MB | 11.7 MB | 1.6 KB |
3 | ed/c33b4b | 24192 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr1) | COMPLETED | 0 | 2024-10-18 17:11:33.383 | 4s | 3.4s | 40.9% | 112 MB | 668.3 MB | 20.7 MB | 796 B |
2 | 51/6248c5 | 24169 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:33.282 | 4.2s | 3.5s | 43.2% | 111.1 MB | 668.3 MB | 20.7 MB | 1.2 KB |
4 | 8b/e5118d | 24883 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:major_direction:ten_percent_counts_sum (random_name) | COMPLETED | 0 | 2024-10-18 17:11:37.579 | 1.3s | 953ms | 88.4% | 9 MB | 14.8 MB | 9.9 MB | 666 B |
5 | 25/1f2003 | 25055 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr1) | COMPLETED | 0 | 2024-10-18 17:11:38.985 | 2.1s | 1.7s | 86.1% | 18.8 MB | 89.5 MB | 20.7 MB | 2.2 KB |
6 | 1a/d1d23e | 25305 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:harmonization (random_name_chr22) | COMPLETED | 0 | 2024-10-18 17:11:41.136 | 1.9s | 1.7s | 88.6% | 16.7 MB | 87.8 MB | 20.7 MB | 1.3 KB |
7 | f6/d879a1 | 25551 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:main_harm:concatenate_chr_splits (random_name) | COMPLETED | 0 | 2024-10-18 17:11:43.113 | 438ms | 23ms | 57.1% | 3.2 MB | 5.4 MB | 64.1 KB | 1.3 KB |
8 | c1/cba6ce | 25688 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:qc (random_name) | COMPLETED | 0 | 2024-10-18 17:11:43.605 | 948ms | 570ms | 51.9% | 8.9 MB | 14.8 MB | 2.9 MB | 2.1 KB |
9 | 73/4fcc1a | 25922 | NFCORE_GWASCATALOGHARM:GWASCATALOGHARM:quality_control:harmonization_log (random_name) | COMPLETED | 0 | 2024-10-18 17:11:44.583 | 2.2s | 1.8s | 61.8% | 97.1 MB | 574.5 MB | 11.5 MB | 25.5 KB |