Version: next

📋 PEG Metadata Standard

The metadata consists of four primary components:

Dataset Description: descriptors for the whole PEG matrix (trait, source of the GWAS data and publication reference)
Genomic Identifiers: details about the variants, genes, or locus included in your dataset.
Evidence: explains the evidence columns and their associated categories, and links provenance and analysis methods via source_tag and method_tag.
Integration: information about what and how different streams of evidence are combined.

In addition, there are two modular components:

Source: citation and provenance information for each evidence stream, including publications, databases, and biosample details.
Method: a description of the methodology, pipelines, or softwares used to generate the data.

These modular components can be referenced by multiple evidence entries.

Detailed descriptions of each component are provided in the corresponding tabs below:

Standard Content

Field	Description	Requirement	Data_format	Example
peg_source	Identifier of the origin of the PEG list (e.g., publication, DOI, preprint, URL).	Recommended	string (PMID, DOI, URL)	PMID:36357675
trait_description	Free-text description of the phenotype under investigation. Should be concise but clear to a non-specialist. Avoid abbreviations.	Mandatory	string	Ascorbic acid 3-sulfate levels
trait_ontology_id	Standard ontology identifier mapped to the trait (e.g., EFO, MONDO, HPO, DOID). Use the most specific term available.	Optional	string	EFO_0800173
gwas_source	Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession.	Recommended	string(GCST[0-9]+, PMID, other accession ID)	GCST000001
gwas_sample_description	Only required if `gwas_source` is not a GWAS Catalog accession. Detailed description of the GWAS samples (e.g., cohort name, case/control numbers, ancestry).	Optional	string	6,136 Finnish ancestry individuals
gwas_sample_size	Only required if `gwas_source` is not a GWAS Catalog accession. Total number of individuals included in the GWAS analysis.	Optional	integer	6136
gwas_case_control_study	Only required if `gwas_source` is not a GWAS Catalog accession. Indicator of whether the GWAS design is case–control (TRUE) or quantitative/other (FALSE).	Optional	boolean	FALSE
gwas_sample_ancestry	Only required if `gwas_source` is not a GWAS Catalog accession. Free-text description of participant ancestry, as reported in the original study.	Optional	string	Finnish
gwas_sample_ancestry_label	Harmonised ancestry label appropriate for the sample. For label definitions, see Morales et al., 2018 (Table 1). Only required if `gwas_source` is not a GWAS Catalog accession.	Optional	string (controlled vocabulary)	European

Field	Description	Requirement	Data_format	Example
variant_type	Explanation of how the main variant was selected (e.g., lead, sentinel, index, mixed).	Recommended	string	lead
variant_information	Additional free-text notes about the variant (e.g., selection method, quality thresholds, imputation info).	Optional	string	Select variant with lowest p-value at locus among three GWAS studies
genome_build	Genome assembly used to map variants.	Mandatory	GRCh38, GRCh37, NCBI36, NCBI35, NCBI34	GRCh38
gene_id_source_version	Version of the gene identifier source (e.g., Ensembl release).	Optional	string	Ensembl v109
gene_symbol_source_version	Version of the gene symbol reference authority (e.g., HGNC release).	Optional	string	HGNC 2025-07-30
gene_info	Additional gene-level metadata that supports interpretation.	Optional	string	/
locus_type	Method used to define locus boundaries (e.g., LD region, ±500kb window, fine-mapped credible set).	Optional	string	LD
locus_id	Provide the explanation of how the identifier was derived (e.g., lead SNP rsID, cytoband).	Optional	string	lead SNP rsID
locus_info	Additional information supporting locus interpretation (e.g., recombination hotspot boundaries, fine-mapping method).	Optional	string	Defined as ±500 kb around lead SNP

Field	Description	Requirement	Data_format / Allowed values	Example
evidence_stream_tag	Specific analysis stream within the evidence category.	Mandatory	string (e.g. eQTL, pQTL, sQTL, TWAS, PWAS etc.)	eQTL-pancreas
evidence_category	Full evidence category name from the controlled list.	Mandatory	Controlled vocabulary	Molecular QTL
evidence_category_abbreviation	Short label assigned from the controlled list of evidence categories.	Mandatory	Controlled vocabulary	QTL
variant_or_gene_centric	Indicates whether the evidence originates from variant-level or gene-level analysis.	Mandatory	`variant-centric` or `gene-centric`	variant-centric
source_tag	Identifier for the data source, created in the `source tab`.	Optional	string (preferred: lowercase with underscores)	source_gtex_pancreas
method_tag	Identifier for the analysis method, created in the `method tab`.	Optional	string (preferred: lowercase with underscores)	method_fastqtl
threshold	Threshold applied to define significance or inclusion criteria.	Optional	logical expression / numeric cutoff	< 0.05
note	Additional free text clarifications to aid interpretation.	Optional	string	Adjusted for covariates (age, sex, BMI)
column_header	Unique column name used in the PEG evidence matrix. Should follow a consistent naming convention.	Mandatory	any, suggest category_stream_[xyz]	QTL_eQTL-pancreas_pvalue
column_description	Free text explanation of the content in this column.	Mandatory	string	p-value from eQTL analysis in pancreas tissue

Field	Description	Requirement	Data_format / Allowed values	Example
integration_tag	Author-assigned intregration analysis name that can be cited as evidence in the integrated_analysis_name field.	Mandatory	string	pops
evidence_streams_included	A list of variant-centric or gene-centric evidence stream names combined in the integration.	Optional	List of controlled terms, separated by “\|”	FUNC \| eQTL \| pQTL \| FM \| 3D \| PHEWAS \| TWAS
integrations_included	A list of integration_tag values from other integration analyses that are included in this integration analysis.	Optional	List of integration tags, separated by “\|”	pops \| flame
method_tag	Identifier for the analysis method (from Method tab).	Mandatory	string (preferred: lowercase with underscores)	soft_pops
threshold	Threshold applied to define significance or inclusion criteria.	Optional	logical expression / numeric cutoff	> 3
note	Extra details to aid interpretation.	Optional	string	Weighted by tissue-specific relevance
column_header	Column name in the PEG evidence matrix.	Mandatory	string (format: INT_[integration_tag]_[descriptor])	INT_pops
column_description	Explanation of the content in this column.	Mandatory	string	Integrated score for prioritised gene using PoPS (gene prioritisation method combining GWAS signals, expression, pathways, and PPI data).
author_conclusion	Indicates when values in this column reflect the authors’ conclusions for defining the PEG list. NOTE: only ONE column per matrix can be assiged as `True`. PEGASUS recommend including the string 'author_conclusion' in the appropirate column header.	Mandatory	Boolean (True / False)	True

If a source_tag is used for any evidence, please follow the requirements for each attribute in the source entity to provide more details. If that’s not applicable, feel free to skip this entity.

Provenance
biosample

Field	Description	Requirement	Data_format / Allowed values	Example
source_tag	Unique identifier for the source, this tag is referenced in the evidence metadata and integration metadata.	Mandatory	any (preferred: lowercase with underscores)	source_gtex_pancreas
provenance	Project, database, or lab providing the data.	Mandatory	string	• Use `lab_internal` for unpublished data generated by your lab. • Use `publication` for published data not deposited in a repository. • Use the official name of the resource (e.g., GTEx, ENCODE, Roadmap, UKB).
file_name	Exact filename of the source dataset.	Optional	string	GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
version	Version or release of the dataset.	Optional	string	v8
url	Web link to the source file used in the analysis. If multiple files are used, list each on a new line; other columns should remain identical.	Optional	URL	https://gtexportal.org/
accession_id	Accession identifier if the source file comes from the repository.	Optional	string	phs000424.v8.p2
doi	DOI for the publication containing the source file.	Optional	DOI string	10.1038/ng.2653
note	Extra details to aid interpretation of the source	Optional	string	The analysis includes only samples from individuals aged 20–29.

Field	Description	Mandatory	Data_format / Allowed values	Example
Tissue	Primary tissue sampled (broad anatomical source).	Optional	string	pancreas
sample_origin	Biological origin of the sample.	Optional	primary-tissue, organoid, cell-line, iPSC-derived, etc.	primary tissue
Cell_type	Specific cell type, if applicable.	Optional	string	alpha cells
cell_line	Cell line name if sample_origin = "cell-line".	Optional	string	HeLa, K562
disease	Disease status of the donor or sample.	Optional	`healthy` or `disease`	healthy
life_stage	Developmental stage of the biosample.	Optional	string (e.g., fetal, adult, embryonic, iPSC)	adult
treatment	Treatments or perturbations applied prior to or during data generation.	Optional	string	anti-IgM treated
sex	Sex composition of samples.	Optional	male, female, mixed	mixed
species	Organism from which the biosample is derived (Full latin name)	Optional	string	Homo sapiens
description	Brief summary of sample characteristics and intended use.	Optional	any	Bulk pancreas tissue from healthy adult donors in GTEx v8. Used for eQTL discovery. Donors male and female, aged 20–70 years, no treatment.

If a method_tag is used for any evidence, please follow the requirements for each attribute in the method entity to provide more details. If that’s not applicable, feel free to skip this entity.

Field	Description	Requirement	Data_format / Allowed values	Example
method_tag	Unique identifier for the method, used in the PEG evidence and integration metadata.	Mandatory	string(lowercase with underscores)	soft_cadd
method_mode	Specifies whether the method is a published software tool or a manual approach. If `software`, provide name, version, URL, and DOI. If `manual`, describe in `method_description`.	Mandatory	`software`, `manual`	software
method_mode_ontology_term_id	For evidence: • Manual assertion (ECO:0000218) • Automatic assertion (ECO:0000203) For integration: • Manually integrated combinatorial evidence (ECO:0007674) • Automatically integrated combinatorial evidence (ECO:0007673)	Mandatory	string	ECO_0007673
software_name	Name of the software used (if `method_mode = software`).	Optional	string	CADD
software_version	Version of the software used.	Optional	string	v1.6
software_url	Link to the official software resource.	Optional	URL	https://cadd.gs.washington.edu/
software_doi	DOI of the software or associated publication.	Optional	DOI string	10.1038/ng.2474
method_description	Detailed description of the method, workflow, or customisation applied.	Optional	string	Custom scoring model combining eQTL and chromatin interaction data.
note	Extra details to aid interpretation of the method	Optional	string	CADD scores were used for variant annotation. Variants with low predicted impact were filtered prior to annotation.

Standard Content​

Standard Content