Version: 0.0.1

📋 PEGASUS Metadata Standard

💡 To make it easier to prepare your metadata, we provide a pre-prepared Google Sheet template, which guides you through filling in the required fields.

We also provide detailed explanations for each field, together with example data, in the tables below.

Dataset description

The dataset description records the PEG list source, the studied trait and GWAS source. If the GWAS is not in the GWAS Catalog, include extra population details such as ancestry and sample size.

PEG list
Trait
GWAS (If it comes from GWAS Catalog)
GWAS (other resources)

Field	Description	Mandatory	Data_format	Example
peg_source	Identifier of the origin of the PEG list (e.g., publication, DOI, preprint, URL).	Mandatory	PMID, DOI, URL	PMID:36357675

Field	Description	Mandatory	Data_format	Example
trait_description	Free-text description of the phenotype under investigation. Should be concise but clear to a non-specialist. Avoid abbreviations.	Mandatory	string	Ascorbic acid 3-sulfate levels
trait_ontology_id	Standard ontology identifier mapped to the trait (e.g., EFO, MONDO, HPO, DOID). Use the most specific term available.	Optional	CURIE (ontology prefix:ID)	EFO_0800173

Field	Description	Mandatory	Data_format	Example
gwas_source	Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession.	Recommended	GCST[0-9]+, PMID, other accession ID	GCST000001

Field	Description	Mandatory	Data_format	Example
gwas_source	Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession.	Optional	GCST[0-9]+, PMID, other accession ID	GCST000001
gwas_sample_description	Only required if `gwas_source` is not a GWAS Catalog accession. Detailed description of the GWAS samples (e.g., cohort name, case/control numbers, ancestry).	Optional	string	6,136 Finnish ancestry individuals
gwas_sample_size	Only required if `gwas_source` is not a GWAS Catalog accession. Total number of individuals included in the GWAS analysis.	Optional	integer	6136
gwas_case_control_study	Only required if `gwas_source` is not a GWAS Catalog accession. Indicator of whether the GWAS design is case–control (TRUE) or quantitative/other (FALSE).	Optional	boolean	FALSE
gwas_sample_ancestry	Only required if `gwas_source` is not a GWAS Catalog accession. Free-text description of participant ancestry, as reported in the original study.	Optional	string	Finnish
gwas_sample_ancestry_label	Harmonised ancestry label appropriate for the sample. For label definitions, see Morales et al., 2018 (Table 1). Only required if `gwas_source` is not a GWAS Catalog accession.	Optional	string (controlled vocabulary)	European

Describes how variants are selected, which gene identifier system and version are used, and how locus are defined and named.

Variant
Gene
Locus

Field	Description	Mandatory	Data_format	Example
variant_type	Explanation of how the main variant was selected (e.g., lead, sentinel, index, mixed).	Recommended	string	lead
variant_information	Additional free-text notes about the variant (e.g., selection method, quality thresholds, imputation info).	Optional	string	Select variant with lowest p-value at locus among three GWAS studies
genome_build	Genome assembly used to map variants.	Recommended	GRCh38, GRCh37, NCBI36, NCBI35, NCBI34	GRCh38

Field	Description	Mandatory	Data_format	Example
gene_id_source_version	Version of the gene identifier source (e.g., Ensembl release).	Optional	string	Ensembl v109
gene_symbol_source_version	Version of the gene symbol reference authority (e.g., HGNC release).	Optional	string	HGNC 2025-07-30
gene_info	Additional gene-level metadata that supports interpretation.	Optional	string	/

Field	Description	Mandatory	Data_format	Example
locus_type	Method used to define locus boundaries (e.g., LD region, ±500kb window, fine-mapped credible set).	Optional	string	LD
locus_id	Provide the explanation of how the identifier was derived (e.g., lead SNP rsID, cytoband).	Optional	string	lead SNP rsID
locus_info	Additional information supporting locus interpretation (e.g., recombination hotspot boundaries, fine-mapping method).	Optional	string	Defined as ±500 kb around lead SNP

Field	Description	Mandatory	Data_format / Allowed values	Example
Column_name	Unique column name used in the PEG evidence matrix. Should follow a consistent naming convention.	Mandatory	any, suggest category_stream_[xyz]	QTL_eQTL-pancreas_pvalue
Column_description	Free text explanation of the content in this column.	Mandatory	string	p-value from eQTL analysis in pancreas tissue
Stream_name	Specific analysis stream within the evidence category.	Optional	string, e.g. eQTL, pQTL, sQTL, TWAs, PWAS etc.	eQTL
Category	Full evidence category name from the controlled list.	Mandatory	Controlled vocabulary	Molecular QTL
Category_abbreviation	Short label assigned from the controlled list of evidence categories.	Mandatory	Controlled vocabulary	QTL
Class	Indicates whether the evidence originates from variant-level or gene-level analysis.	Mandatory	`variant-centric` or `gene-centric`	variant-centric
Source_tag	Identifier for the data source, created in the `source tab`.	Mandatory	any( preferred: lowercase with underscores)	source_gtex_pancreas
Method_tag	Identifier for the analysis method, created in the `method tab`.	Mandatory	any ( preferred: lowercase with underscores)	method_fastqtl
Threshold	Threshold applied to define significance or inclusion criteria.	Optional	logical expression / numeric cutoff	< 0.05
Notes	Additional free text clarifications to aid interpretation.	Optional	any	Adjusted for covariates (age, sex, BMI)

Field	Description	Mandatory	Data_format / Allowed values	Example
Column_name	Column name in the PEG evidence matrix. Use a consistent naming convention.	Mandatory	INT_[descriptor]	INT_pops
Column_description	Explanation of the content in this column.	Mandatory	Free text	Integrated score for prioritised gene using PoPS (gene prioritisation method combining GWAS signals, expression, pathways, and PPI data).
Integration_analysis	Author-assigned analysis name that can be cited as evidence in the integrated_analysis_name field.	Mandatory	curated, computational, mixed	computational
Evidence_stream_name	A list of variant-centric or gene-centric evidence stream names combined in the integration.	Optional	List of controlled terms, separated by “\|”	FUNC \| eQTL \| pQTL \| FM \| 3D \| PHEWAS \| TWAS
Integrated_analysis_name	A list of Int_tag values from other integration analyses that are used as evidence in this analysis.	Optional	List of controlled terms, separated by “\|”	FUNC \| eQTL \| pQTL \| FM \| 3D \| PHEWAS \| TWAS
Method_tag	Identifier for the analysis method (from Method tab).	Mandatory	any (preferred: lowercase with underscores)	soft_pops
Threshold	Threshold applied to define significance or inclusion criteria.	Optional	logical expression / numeric cutoff	> 3
Notes	Extra details to aid interpretation.	Optional	any	Weighted by tissue-specific relevance

The source metadata file mainly contains the provenance of source files and detailed biosample descriptions.

Provenance
biosample

Field	Description	Mandatory	Data_format / Allowed values	Example
source_tag	Unique identifier for the source, this tag is referenced in the evidence metadata and integration metadata.	Mandatory	any (preferred: lowercase with underscores)	source_gtex_pancreas
Provenance	Project, database, or lab providing the data.	Mandatory	• Use `lab_internal` for unpublished data generated by your lab. • Use `publication` for published data not deposited in a repository. • Use the official name of the resource (e.g., GTEx, ENCODE, Roadmap, UKB).	GTEx
File_Name	Exact filename of the source dataset.	Optional	any	GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
Version	Version or release of the dataset.	Optional	any	v8
URL	Web link to the source file used in the analysis. If multiple files are used, list each on a new line; other columns should remain identical.	Optional	URL	https://gtexportal.org/
Accession	Accession identifier if the source file comes from the repository.	Optional	any (e.g., GEO, dbGaP, ENA ID)	phs000424.v8.p2
DOI	DOI for the publication containing the source file.	Optional	DOI string	10.1038/ng.2653

Field	Description	Mandatory	Data_format / Allowed values	Example
Tissue	Primary tissue sampled (broad anatomical source).	Optional	any	pancreas
Sample_origin	Biological origin of the sample.	Optional	primary-tissue, organoid, cell-line, iPSC-derived, etc.	primary tissue
Cell_type	Specific cell type, if applicable.	Optional	any	alpha cells
Cell_line	Cell line name if sample_origin = "cell-line".	Optional	any	HeLa, K562
Disease	Disease status of the donor or sample.	Optional	healthy or disease name	healthy
Life_stage	Developmental stage of the biosample.	Optional	any (e.g., fetal, adult, embryonic, iPSC)	adult
Treatment	Treatments or perturbations applied prior to or during data generation.	Optional	any	anti-IgM treated
Sex	Sex composition of samples.	Optional	male, female, mixed	mixed
Age	Age of donors (number, range, or developmental notation).	Optional	any	20–70 years
Species	Organism from which the biosample is derived	Mandatory	Latin species name	Homo sapiens
Description	Brief summary of sample characteristics and intended use.	Optional	any	Bulk pancreas tissue from healthy adult donors in GTEx v8. Used for eQTL discovery. Donors male and female, aged 20–70 years, no treatment.

Field	Description	Mandatory	Data_format / Allowed values	Example
method_tag	Unique identifier for the method, used in the PEG evidence and integration metadata.	Mandatory	free text (lowercase with underscores)	soft_cadd
method_mode	Specifies whether the method is a published software tool or a manual approach. If `software`, provide name, version, URL, and DOI. If `manual`, describe in `method_description`.	Mandatory	`software`, `manual`	software
software_name	Name of the software used (if `method_mode = software`).	Optional	any	CADD
software_version	Version of the software used.	Optional	any	v1.6
software_url	Link to the official software resource.	Optional	URL	https://cadd.gs.washington.edu/
software_doi	DOI of the software or associated publication.	Optional	DOI string	10.1038/ng.2474
method_description	Detailed description of the method, workflow, or customisation applied.	Optional	any	Custom scoring model combining eQTL and chromatin interaction data.

Dataset description​

Dataset description