π PEG Metadata Standard
π‘ To make it easier to prepare your metadata, we provide a pre-prepared Google Sheet template, which guides you through filling in the required fields.
We also provide detailed explanations for each field, together with example data, in the tables below.
Dataset descriptionβ
- π Dataset description
- 𧬠Genomic Identifier
- π Evidence
- π Integration
- π Source
- βοΈMethod
The dataset description records the PEG list source
, the studied trait
and GWAS source
. If the GWAS is not in the GWAS Catalog, include extra population details such as ancestry and sample size.
- PEG list
- Trait
- GWAS (If it comes from GWAS Catalog)
- GWAS (other resources)
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
peg_source | Identifier of the origin of the PEG list (e.g., publication, DOI, preprint, URL). | Mandatory | PMID, DOI, URL | PMID:36357675 |
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
trait_description | Free-text description of the phenotype under investigation. Should be concise but clear to a non-specialist. Avoid abbreviations. | Mandatory | string | Ascorbic acid 3-sulfate levels |
trait_ontology_id | Standard ontology identifier mapped to the trait (e.g., EFO, MONDO, HPO, DOID). Use the most specific term available. | Optional | CURIE (ontology prefix:ID) | EFO_0800173 |
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
gwas_source | Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession. | Recommended | GCST[0-9]+, PMID, other accession ID | GCST000001 |
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
gwas_source | Identifier of the GWAS source. Prefer GWAS Catalog accession (GCST); if not available, use PubMed ID or another recognised accession. | Optional | GCST[0-9]+, PMID, other accession ID | GCST000001 |
gwas_sample_description | Only required if gwas_source is not a GWAS Catalog accession. Detailed description of the GWAS samples (e.g., cohort name, case/control numbers, ancestry). | Optional | string | 6,136 Finnish ancestry individuals |
gwas_sample_size | Only required if gwas_source is not a GWAS Catalog accession. Total number of individuals included in the GWAS analysis. | Optional | integer | 6136 |
gwas_case_control_study | Only required if gwas_source is not a GWAS Catalog accession. Indicator of whether the GWAS design is caseβcontrol (TRUE) or quantitative/other (FALSE). | Optional | boolean | FALSE |
gwas_sample_ancestry | Only required if gwas_source is not a GWAS Catalog accession. Free-text description of participant ancestry, as reported in the original study. | Optional | string | Finnish |
gwas_sample_ancestry_label | Harmonised ancestry label appropriate for the sample. For label definitions, see Morales et al., 2018 (Table 1). Only required if gwas_source is not a GWAS Catalog accession. | Optional | string (controlled vocabulary) | European |
Describes how variants
are selected, which gene identifier
system and version are used, and how locus
are defined and named.
- Variant
- Gene
- Locus
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
variant_type | Explanation of how the main variant was selected (e.g., lead, sentinel, index, mixed). | Recommended | string | lead |
variant_information | Additional free-text notes about the variant (e.g., selection method, quality thresholds, imputation info). | Optional | string | Select variant with lowest p-value at locus among three GWAS studies |
genome_build | Genome assembly used to map variants. | Recommended | GRCh38, GRCh37, NCBI36, NCBI35, NCBI34 | GRCh38 |
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
gene_id_source_version | Version of the gene identifier source (e.g., Ensembl release). | Optional | string | Ensembl v109 |
gene_symbol_source_version | Version of the gene symbol reference authority (e.g., HGNC release). | Optional | string | HGNC 2025-07-30 |
gene_info | Additional gene-level metadata that supports interpretation. | Optional | string | / |
Field | Description | Mandatory | Data_format | Example |
---|---|---|---|---|
locus_type | Method used to define locus boundaries (e.g., LD region, Β±500kb window, fine-mapped credible set). | Optional | string | LD |
locus_id | Provide the explanation of how the identifier was derived (e.g., lead SNP rsID, cytoband). | Optional | string | lead SNP rsID |
locus_info | Additional information supporting locus interpretation (e.g., recombination hotspot boundaries, fine-mapping method). | Optional | string | Defined as Β±500 kb around lead SNP |
Field | Description | Mandatory | Data_format / Allowed values | Example |
---|---|---|---|---|
Column_name | Unique column name used in the PEG evidence matrix. Should follow a consistent naming convention. | Mandatory | any, suggest category_stream_[xyz] | QTL_eQTL-pancreas_pvalue |
Column_description | Free text explanation of the content in this column. | Mandatory | string | p-value from eQTL analysis in pancreas tissue |
Stream_name | Specific analysis stream within the evidence category. | Optional | string, e.g. eQTL, pQTL, sQTL, TWAs, PWAS etc. | eQTL |
Category | Full evidence category name from the controlled list. | Mandatory | Controlled vocabulary | Molecular QTL |
Category_abbreviation | Short label assigned from the controlled list of evidence categories. | Mandatory | Controlled vocabulary | QTL |
Class | Indicates whether the evidence originates from variant-level or gene-level analysis. | Mandatory | variant-centric or gene-centric | variant-centric |
Source_tag | Identifier for the data source, created in the source tab . | Mandatory | any( preferred: lowercase with underscores) | source_gtex_pancreas |
Method_tag | Identifier for the analysis method, created in the method tab . | Mandatory | any ( preferred: lowercase with underscores) | method_fastqtl |
Threshold | Threshold applied to define significance or inclusion criteria. | Optional | logical expression / numeric cutoff | < 0.05 |
Notes | Additional free text clarifications to aid interpretation. | Optional | any | Adjusted for covariates (age, sex, BMI) |
Field | Description | Mandatory | Data_format / Allowed values | Example |
---|---|---|---|---|
Column_name | Column name in the PEG evidence matrix. Use a consistent naming convention. | Mandatory | INT_[descriptor] | INT_pops |
Column_description | Explanation of the content in this column. | Mandatory | Free text | Integrated score for prioritised gene using PoPS (gene prioritisation method combining GWAS signals, expression, pathways, and PPI data). |
Integration_analysis | Author-assigned analysis name that can be cited as evidence in the integrated_analysis_name field. | Mandatory | curated, computational, mixed | computational |
Evidence_stream_name | A list of variant-centric or gene-centric evidence stream names combined in the integration. | Optional | List of controlled terms, separated by β|β | FUNC | eQTL | pQTL | FM | 3D | PHEWAS | TWAS |
Integrated_analysis_name | A list of Int_tag values from other integration analyses that are used as evidence in this analysis. | Optional | List of controlled terms, separated by β|β | FUNC | eQTL | pQTL | FM | 3D | PHEWAS | TWAS |
Method_tag | Identifier for the analysis method (from Method tab). | Mandatory | any (preferred: lowercase with underscores) | soft_pops |
Threshold | Threshold applied to define significance or inclusion criteria. | Optional | logical expression / numeric cutoff | > 3 |
Notes | Extra details to aid interpretation. | Optional | any | Weighted by tissue-specific relevance |
The source metadata file mainly contains the provenance
of source files and detailed biosample
descriptions.
- Provenance
- biosample
Field | Description | Mandatory | Data_format / Allowed values | Example |
---|---|---|---|---|
source_tag | Unique identifier for the source, this tag is referenced in the evidence metadata and integration metadata. | Mandatory | any (preferred: lowercase with underscores) | source_gtex_pancreas |
Provenance | Project, database, or lab providing the data. | Mandatory | β’ Use | GTEx |
File_Name | Exact filename of the source dataset. | Optional | any | GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz |
Version | Version or release of the dataset. | Optional | any | v8 |
URL | Web link to the source file used in the analysis. If multiple files are used, list each on a new line; other columns should remain identical. | Optional | URL | https://gtexportal.org/ |
Accession | Accession identifier if the source file comes from the repository. | Optional | any (e.g., GEO, dbGaP, ENA ID) | phs000424.v8.p2 |
DOI | DOI for the publication containing the source file. | Optional | DOI string | 10.1038/ng.2653 |
Field | Description | Mandatory | Data_format / Allowed values | Example |
---|---|---|---|---|
Tissue | Primary tissue sampled (broad anatomical source). | Optional | any | pancreas |
Sample_origin | Biological origin of the sample. | Optional | primary-tissue, organoid, cell-line, iPSC-derived, etc. | primary tissue |
Cell_type | Specific cell type, if applicable. | Optional | any | alpha cells |
Cell_line | Cell line name if sample_origin = "cell-line". | Optional | any | HeLa, K562 |
Disease | Disease status of the donor or sample. | Optional | healthy or disease name | healthy |
Life_stage | Developmental stage of the biosample. | Optional | any (e.g., fetal, adult, embryonic, iPSC) | adult |
Treatment | Treatments or perturbations applied prior to or during data generation. | Optional | any | anti-IgM treated |
Sex | Sex composition of samples. | Optional | male, female, mixed | mixed |
Age | Age of donors (number, range, or developmental notation). | Optional | any | 20β70 years |
Species | Organism from which the biosample is derived | Mandatory | Latin species name | Homo sapiens |
Description | Brief summary of sample characteristics and intended use. | Optional | any | Bulk pancreas tissue from healthy adult donors in GTEx v8. Used for eQTL discovery. Donors male and female, aged 20β70 years, no treatment. |
Field | Description | Mandatory | Data_format / Allowed values | Example |
---|---|---|---|---|
method_tag | Unique identifier for the method, used in the PEG evidence and integration metadata. | Mandatory | free text (lowercase with underscores) | soft_cadd |
method_mode | Specifies whether the method is a published software tool or a manual approach. If software , provide name, version, URL, and DOI. If manual , describe in method_description . | Mandatory | software , manual | software |
software_name | Name of the software used (if method_mode = software ). | Optional | any | CADD |
software_version | Version of the software used. | Optional | any | v1.6 |
software_url | Link to the official software resource. | Optional | URL | https://cadd.gs.washington.edu/ |
software_doi | DOI of the software or associated publication. | Optional | DOI string | 10.1038/ng.2474 |
method_description | Detailed description of the method, workflow, or customisation applied. | Optional | any | Custom scoring model combining eQTL and chromatin interaction data. |