A new standard for GWAS summary statistics July 20, 2022 By Laura Harris

Today we are very pleased to share our manuscript describing a new standard format for GWAS summary statistics, GWAS-SSF. The work towards this goal began in 2017 when we began to host summary statistics in the GWAS Catalog. It rapidly became apparent that summary statistics were formatted in a whole host of different ways, with only a few features in common between them, and little has changed since, with a recent analysis of 327 summary statistics files finding over 100 unique formats. We were about to launch the GWAS Catalog author submission system and needed to enforce a standard format as QC on submissions, and to ensure that submitted data could be passed through our harmonisation pipeline. So we quickly developed a minimal standard format based on the most commonly included fields, with variant ID/genomic location and p-value as mandatory fields, and standard headers for other fields such as beta and CI. This was enough for our goal of getting data into the GWAS Catalog. But we heard many consumers of the data complain that datasets with the minimum mandatory fields did not meet their requirements, for example to use the data for Mendelian randomisation or to generate polygenic scores. We needed to encourage our submitters to share their data more fully. A second issue is that while the GWAS Catalog curates metadata from the literature, there is no standard for GWAS metadata reporting, resulting in missing data that limits re-use of the summary statistics.

We started on a process of consulting with the community that began with our workshop on summary statistics data and sharing in 2020. This was a wide ranging discussion over 2 days that you can read all about in our workshop proceedings. At the end of that process we had an outline of some mandatory fields for data and metadata, but refining these and ensuring they met all the major use cases would take a more focused discussion. Our working group on summary statistics data content and format, chaired by Ines Barroso, held 3 meetings over the course of 2021. We held a parallel working group to discuss diversity and privacy in data sharing (to be reported elsewhere) and incorporated their recommendations (for example on the sharing of allele frequency data) into our decision making process. Both the original workshop and the working group were designed to represent a wide range of stakeholders - from industry and academia, developers of other resources & tools, data generators and data consumers, research groups primarily performing wet-lab research and high-throughput bioinformatics users. Our discussions focused on the need to enforce key mandatory fields to maximise the reuse of the data, but also to recognise that setting the bar too high could discourage data sharing. Similarly, data need to be made accessible in a way that suits the high-throughput user who might be piping data in bulk into a computational pipeline, but also users without advanced bioinformatics skills who might be downloading a single file to investigate a locus of interest. We heard that preparing data for submission to databases can be a huge drain for under-resourced labs, so needs to be as straightforward and user-friendly as possible, and that interoperability between different databases (such as dbGaP, the GWAS Catalog and OpenGWAS) is important to reduce the workload for data generators. You can read the outcome of our discussions in our new manuscript. Behind the scenes, we are making the necessary changes to the GWAS Catalog infrastructure to validate new submissions according to the new standard and where possible, make all our existing summary statistics available in the new format. We are working on tools (together with PLINK and others) to make data preparation and formatting as easy as possible. In the meantime, we welcome feedback from the wider community on the standard and will collate this at the end of August prior to release of the new infrastructure.

The GWAS-SSF is suitable for array-based GWAS and single variant seqGWAS. As data sharing has become more widely accepted, there is appetite from the genomics community to share full p value datasets from gene-based analyses of sequencing data and CNVs. Whilst we don’t have a standard format for these yet, by gathering data we hope to go through a similar process to work towards a standard format. Please share your data, and let us know your thoughts!