π PEG Metadata Standard
The metadata consists of four primary components:
- Dataset Description: descriptors for the whole PEG matrix (trait, source of the GWAS data and publication reference)
- Genomic Identifiers: details about the variants, genes, or locus included in your dataset.
- Evidence: explains the evidence columns and their associated categories, and links provenance and analysis methods via source_tag and method_tag.
- Integration: information about what and how different streams of evidence are combined.
In addition, there are two modular components:
- Source: citation and provenance information for each evidence stream, including publications, databases, and biosample details.
- Method: a description of the methodology, pipelines, or softwares used to generate the data.
These modular components can be referenced by multiple evidence entries.
Detailed descriptions of each component are provided in the corresponding tabs below:
Standard Contentβ
- π Dataset description
- 𧬠Genomic Identifier
- π Evidence
- π Integration
- π Source
- βοΈMethod
| Field | Description | Requirement | Data_format | Example |
|---|---|---|---|---|
| peg_source | Identifier of the origin of the PEG list (e.g., publication, DOI, preprint, URL). | Recommended | string (PMID, DOI, URL) | PMID:36357675 |
| trait_description | Free-text description of the phenotype under investigation. Should be concise but clear to a non-specialist. Avoid abbreviations. | Mandatory | string | Ascorbic acid 3-sulfate levels |
| trait_ontology_id | Standard ontology identifier mapped to the trait (e.g., EFO, MONDO, HPO, DOID). Use the most specific term available. | Optional | string | EFO_0800173 |
| gwas_source | Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession. | Recommended | string(GCST[0-9]+, PMID, other accession ID) | GCST000001 |
| gwas_sample_description | Only required if gwas_source is not a GWAS Catalog accession. Detailed description of the GWAS samples (e.g., cohort name, case/control numbers, ancestry). | Optional | string | 6,136 Finnish ancestry individuals |
| gwas_sample_size | Only required if gwas_source is not a GWAS Catalog accession. Total number of individuals included in the GWAS analysis. | Optional | integer | 6136 |
| gwas_case_control_study | Only required if gwas_source is not a GWAS Catalog accession. Indicator of whether the GWAS design is caseβcontrol (TRUE) or quantitative/other (FALSE). | Optional | boolean | FALSE |
| gwas_sample_ancestry | Only required if gwas_source is not a GWAS Catalog accession. Free-text description of participant ancestry, as reported in the original study. | Optional | string | Finnish |
| gwas_sample_ancestry_label | Harmonised ancestry label appropriate for the sample. For label definitions, see Morales et al., 2018 (Table 1). Only required if gwas_source is not a GWAS Catalog accession. | Optional | string (controlled vocabulary) | European |
| Field | Description | Requirement | Data_format | Example |
|---|---|---|---|---|
| variant_type | Explanation of how the main variant was selected (e.g., lead, sentinel, index, mixed). | Recommended | string | lead |
| variant_information | Additional free-text notes about the variant (e.g., selection method, quality thresholds, imputation info). | Optional | string | Select variant with lowest p-value at locus among three GWAS studies |
| genome_build | Genome assembly used to map variants. | Mandatory | GRCh38, GRCh37, NCBI36, NCBI35, NCBI34 | GRCh38 |
| gene_id_source_version | Version of the gene identifier source (e.g., Ensembl release). | Optional | string | Ensembl v109 |
| gene_symbol_source_version | Version of the gene symbol reference authority (e.g., HGNC release). | Optional | string | HGNC 2025-07-30 |
| gene_info | Additional gene-level metadata that supports interpretation. | Optional | string | / |
| locus_type | Method used to define locus boundaries (e.g., LD region, Β±500kb window, fine-mapped credible set). | Optional | string | LD |
| locus_id | Provide the explanation of how the identifier was derived (e.g., lead SNP rsID, cytoband). | Optional | string | lead SNP rsID |
| locus_info | Additional information supporting locus interpretation (e.g., recombination hotspot boundaries, fine-mapping method). | Optional | string | Defined as Β±500 kb around lead SNP |
| Field | Description | Requirement | Data_format / Allowed values | Example |
|---|---|---|---|---|
| evidence_stream_tag | Specific analysis stream within the evidence category. | Mandatory | string (e.g. eQTL, pQTL, sQTL, TWAS, PWAS etc.) | eQTL-pancreas |
| evidence_category | Full evidence category name from the controlled list. | Mandatory | Controlled vocabulary | Molecular QTL |
| evidence_category_abbreviation | Short label assigned from the controlled list of evidence categories. | Mandatory | Controlled vocabulary | QTL |
| variant_or_gene_centric | Indicates whether the evidence originates from variant-level or gene-level analysis. | Mandatory | variant-centric or gene-centric | variant-centric |
| source_tag | Identifier for the data source, created in the source tab. | Optional | string (preferred: lowercase with underscores) | source_gtex_pancreas |
| method_tag | Identifier for the analysis method, created in the method tab. | Optional | string (preferred: lowercase with underscores) | method_fastqtl |
| threshold | Threshold applied to define significance or inclusion criteria. | Optional | logical expression / numeric cutoff | < 0.05 |
| note | Additional free text clarifications to aid interpretation. | Optional | string | Adjusted for covariates (age, sex, BMI) |
| column_header | Unique column name used in the PEG evidence matrix. Should follow a consistent naming convention. | Mandatory | any, suggest category_stream_[xyz] | QTL_eQTL-pancreas_pvalue |
| column_description | Free text explanation of the content in this column. | Mandatory | string | p-value from eQTL analysis in pancreas tissue |
| Field | Description | Requirement | Data_format / Allowed values | Example |
|---|---|---|---|---|
| integration_tag | Author-assigned intregration analysis name that can be cited as evidence in the integrated_analysis_name field. | Mandatory | string | pops |
| evidence_streams_included | A list of variant-centric or gene-centric evidence stream names combined in the integration. | Optional | List of controlled terms, separated by β|β | FUNC | eQTL | pQTL | FM | 3D | PHEWAS | TWAS |
| integrations_included | A list of integration_tag values from other integration analyses that are included in this integration analysis. | Optional | List of integration tags, separated by β|β | pops | flame |
| method_tag | Identifier for the analysis method (from Method tab). | Mandatory | string (preferred: lowercase with underscores) | soft_pops |
| threshold | Threshold applied to define significance or inclusion criteria. | Optional | logical expression / numeric cutoff | > 3 |
| note | Extra details to aid interpretation. | Optional | string | Weighted by tissue-specific relevance |
| column_header | Column name in the PEG evidence matrix. | Mandatory | string (format: INT_[integration_tag]_[descriptor]) | INT_pops |
| column_description | Explanation of the content in this column. | Mandatory | string | Integrated score for prioritised gene using PoPS (gene prioritisation method combining GWAS signals, expression, pathways, and PPI data). |
| author_conclusion | Indicates when values in this column reflect the authorsβ conclusions for defining the PEG list. NOTE: only ONE column per matrix can be assiged as True. PEGASUS recommend including the string 'author_conclusion' in the appropirate column header. | Mandatory | Boolean (True / False) | True |
If a source_tag is used for any evidence, please follow the requirements for each attribute in the source entity to provide more details. If thatβs not applicable, feel free to skip this entity.
- Provenance
- biosample
| Field | Description | Requirement | Data_format / Allowed values | Example |
|---|---|---|---|---|
| source_tag | Unique identifier for the source, this tag is referenced in the evidence metadata and integration metadata. | Mandatory | any (preferred: lowercase with underscores) | source_gtex_pancreas |
| provenance | Project, database, or lab providing the data. | Mandatory | string | β’ Use |
| file_name | Exact filename of the source dataset. | Optional | string | GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz |
| version | Version or release of the dataset. | Optional | string | v8 |
| url | Web link to the source file used in the analysis. If multiple files are used, list each on a new line; other columns should remain identical. | Optional | URL | https://gtexportal.org/ |
| accession_id | Accession identifier if the source file comes from the repository. | Optional | string | phs000424.v8.p2 |
| doi | DOI for the publication containing the source file. | Optional | DOI string | 10.1038/ng.2653 |
| note | Extra details to aid interpretation of the source | Optional | string | The analysis includes only samples from individuals aged 20β29. |
| Field | Description | Mandatory | Data_format / Allowed values | Example |
|---|---|---|---|---|
| Tissue | Primary tissue sampled (broad anatomical source). | Optional | string | pancreas |
| sample_origin | Biological origin of the sample. | Optional | primary-tissue, organoid, cell-line, iPSC-derived, etc. | primary tissue |
| Cell_type | Specific cell type, if applicable. | Optional | string | alpha cells |
| cell_line | Cell line name if sample_origin = "cell-line". | Optional | string | HeLa, K562 |
| disease | Disease status of the donor or sample. | Optional | healthy or disease | healthy |
| life_stage | Developmental stage of the biosample. | Optional | string (e.g., fetal, adult, embryonic, iPSC) | adult |
| treatment | Treatments or perturbations applied prior to or during data generation. | Optional | string | anti-IgM treated |
| sex | Sex composition of samples. | Optional | male, female, mixed | mixed |
| species | Organism from which the biosample is derived (Full latin name) | Optional | string | Homo sapiens |
| description | Brief summary of sample characteristics and intended use. | Optional | any | Bulk pancreas tissue from healthy adult donors in GTEx v8. Used for eQTL discovery. Donors male and female, aged 20β70 years, no treatment. |
If a method_tag is used for any evidence, please follow the requirements for each attribute in the method entity to provide more details. If thatβs not applicable, feel free to skip this entity.
| Field | Description | Requirement | Data_format / Allowed values | Example |
|---|---|---|---|---|
| method_tag | Unique identifier for the method, used in the PEG evidence and integration metadata. | Mandatory | string(lowercase with underscores) | soft_cadd |
| method_mode | Specifies whether the method is a published software tool or a manual approach. If software, provide name, version, URL, and DOI. If manual, describe in method_description. | Mandatory | software, manual | software |
| method_mode_ontology_term_id | For evidence: | Mandatory | string | ECO_0007673 |
| software_name | Name of the software used (if method_mode = software). | Optional | string | CADD |
| software_version | Version of the software used. | Optional | string | v1.6 |
| software_url | Link to the official software resource. | Optional | URL | https://cadd.gs.washington.edu/ |
| software_doi | DOI of the software or associated publication. | Optional | DOI string | 10.1038/ng.2474 |
| method_description | Detailed description of the method, workflow, or customisation applied. | Optional | string | Custom scoring model combining eQTL and chromatin interaction data. |
| note | Extra details to aid interpretation of the method | Optional | string | CADD scores were used for variant annotation. Variants with low predicted impact were filtered prior to annotation. |