Data elements definitions

The terms describing the data stored in the network of AIRR-seq repositories federated by the iReceptor Gateway are driven by the recommendations of the AIRR Community: the AIRR Minimal Standards (MiAIRR) for study metadata and the AIRR Rearrangement schema representing annotated rearrangements. For more information, visit the AIRR Community documentation page.

The definitions and relevant examples for each field are given below:

Study

Example
Contact info (collection) Full contact information of the data collector, i.e. the person who is legally responsible for data collection and release. This should include an e-mail address. Dr. P. Stibbons, p.stibbons@unseenu.edu
Contact info (deposition) Full contact information of the data depositor, i.e. the person submitting the data to a repository. This is supposed to be a short-lived and technical role until the submission is relased. Adrian Turnipseed, a.turnipseed@unseenu.edu
Grant funding agency Funding agencies and grant numbers NIH, award number R01GM987654
Inclusion/exclusion criteria List of criteria for inclusion/exclusion for the study Include: Clinical P. falciparum infection; Exclude: Seropositive for HIV
Lab address Institution and institutional address of data collector School of Medicine, Unseen University, Ankh-Morpork, Disk World
Lab name Department of data collector Department for Planar Immunology
Relevant publications Publications describing the rationale and/or outcome of the study PMID85642
Study ID Unique ID assigned by study registry PRJNA001
Study title Descriptive study title Effects of sun light exposure of the Treg repertoire
Study type Generic study design Case-Control Study

Subject

Example
Age Absolute age of subject at time point `Age event` 65 a
Age event Event in the study schedule to which `Age` refers. For NCBI BioSample this MUST be `sampling`. For other implementations submitters need to be aware that there is currently no mechanism to encode to potential delta between `Age event` and `Sample collection time`, hence the chosen events should be in temporal proximity. enrollment
Ancestry population Broad geographic origin of ancestry (continent) list of continents, mixed or unknown
Ethnicity Ethnic group of subject (defined as cultural/language-based membership) English, Kurds, Manchu, Yakuts (and other fields from Wikipedia)
Organism Species of subject (using binomial nomenclature) Homo sapiens
Race Racial group of subject (as defined by NIH) White, American Indian or Alaska Native, Black, Asian, Native Hawaiian or Other Pacific Islander, Other
Relation to other subjects Subject ID to which `Relation type` refers SUB1355648
Relation type Relation between subject and `linked_subjects`, can be genetic or environmental (e.g.exposure) father, daughter, household
Sex Biological sex of subject female
Strain name Non-human: designation of the strain or breed of animal used C57BL/6J
Subject ID Subject ID assigned by submitter, unique within study SUB856413
Synthetic library TRUE for libraries in which the diversity has been synthetically generated (e.g. phage display) FALSE

Sample

Example
Anatomic site The anatomic location of the tissue, e.g. Inguinal, femur Iliac crest
Biomaterial provider Name and address of the entity providing the sample Tissues-R-Us, Tampa, FL, USA
Collection time event Event in the study schedule to which `Sample collection time` relates to Primary vaccination
Sample collection time Time point at which sample was taken, relative to `Collection time event` 14 d
Sample disease state Histopathologic evaluation of the sample Tumor infiltration
Sample ID Sample ID assigned by submitter, unique within study SUP52415
Sample type The way the sample was obtained, e.g. fine-needle aspirate, organ harvest, peripheral venous puncture Biopsy
Tissue The actual tissue sampled, e.g. lymph node, liver, peripheral blood Bone marrow

Diagnosis

Example
Diagnosis Diagnosis of subject Multiple myeloma
Disease stage Stage of disease at current intervention Stage II
Immunogen/agent Antigen, vaccine or drug applied to subject at this intervention bortezomib
Intervention definition Description of intervention systemic chemotherapy, 6 cycles, 1.25 mg/m2
Length of disease Time duration between initial diagnosis and current intervention 23 months
Medical history Medical history of subject that is relevant to assess the course of disease and/or treatment MGUS, first diagnosed 5 years prior
Prior therapies List of all relevant previous therapies applied to subject for treatment of `Diagnosis` melphalan/prednisone
Study group Designation of study arm to which the subject is assigned to control

Cell Processing

Example
# cells/experiment Total number of cells that went into the experiment 1000000
# cells/sequencing reaction Number of cells for each biological replicate 50000
Cell isolation procedure Description of the procedure used for marker-based isolation or enrich cells Cells were stained with fluorochrome labeled antibodies and then sorted on a FlowMerlin (CE) cytometer
Cell quality Relative amount of viable cells after preparation and (if applicable) thawing 90% viability as determined by 7-AAD
Cell storage TRUE if cells were cryo-preserved between isolation and further processing TRUE
Cell subset Commonly-used designation of isolated cell population class switched memory B cell
Cell subset phenotype List of cellular markers and their expression levels used to isolate the cell population CD19+ CD38+ CD27+ IgM- IgD-
Processing protocol Description of the methods applied to the sample including cell preparation/ isolation/enrichment and nucleic acid extraction. This should closely mirror the Materials and methods section in the manuscript Stimulated wih anti-CD3/anti-CD28
Single-cell sort TRUE if single cells were isolated into separate compartments FALSE
Tissue processing Enzymatic digestion and/or physical methods used to isolate cells from sample Collagenase A/Dnase I digested, followed by Percoll gradient

Nucleic Acid Processing

Example
Complete sequences To be considered `complete`, the procedure used for library construction MUST generate sequences that 1) include the first V segment codon that encodes the mature polypeptide chain (i.e. after the leader sequence) and 2) include the last complete codon of the J segment (i.e. 1 bp before the J->C splice site) and 3) provide sequence information for all positions between 1) and 2). To be considered `complete & untemplated`, the sections of the sequences defined in points 1) to 3) of the previous sentence MUST be untemplated, i.e. MUST NOT overlap with the primers used in library preparation. partial
Fwd PCR primer target Position of the most distal nucleotide templated by the forward primer or primer mix IGHV, +23
Library generation method Generic type of library generation RT(oligo-dT)+PCR
Library generation protocol Description of processes applied to substrate to obtain a library that is ready for sequencing cDNA was generated using
Linkage of loci Describes the mode of linkage if a method was used which physically links nucleic acids derived from distinct loci in a single-cell context IGH-IGK/IGL-head/head
Protocol IDs When using a library generation protocol from a commercial provider, provide the protocol version number v2.1 (2016-09-15)
Rev PCR primer target Position of the most proximal nucleotide templated by the reverse primer or primer mix IGHG, +57
Target locus for PCR Designation of the target locus according to IMGT nomencleature IGK
Target substrate The class of nucleic acid that was used as primary starting material for the following procedures RNA
Target substrate quality Description and results of the quality control performed on the template material RIN 9.2
Template amount Amount of template that went into the process 1000 ng

Sequencing Run

Example
Batch number ID of sequencing run assigned by the sequencing facility 160101_M01234_0201_000000000-D2T7V
Date of sequencing run Date of sequencing run 2016-12-16
Read lengths Read length in bases for each direction [300,300]
Reads passing QC Number of usable reads for analysis 10365118
Sequencing facility Name and address of sequencing facility Seqs-R-Us, Vancouver, BC, Canada
Sequencing kit Name, manufacturer, order and lot numbers of sequencing kit FullSeq 600, Alumina, #M123456C0, 789G1HK
Sequencing platform Designation of sequencing instrument used Alumina LoSeq 1000

Software Processing

Example
Collapsing method The method used for combining multiple sequences from (4) into a single sequence in (5) MUSCLE 3.8.31
Data protocols General description of how QC is performed
Paired read assembly How paired end reads were assembled into a single receptor sequence PandaSeq (minimal overlap 50, threshold 0.8)
Primer match cutoffs How primers were identified in the sequences, were they removed/masked/etc?
Quality thresholds How sequences were removed from (4) based on base quality scores
Software tools/versions Version number and / or date, include company pipelines IgBLAST 1.6

Other

Example
Full-text search Search across all metadata fields (case insensitive) cancer tumor
Repository
Sequences

Rearrangement

Example
C Call C region gene with allele. For example, IGHM*01.
C Cigar CIGAR string for the C gene alignment.
C Germline Alignment Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any).
C Germline Alignment AA Amino acid translation of the c_germline_aligment field.
C IDentity Fractional identity for the C gene alignment.
C Score Alignment score for the C gene alignment.
C Sequence Alignment Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers.
C Sequence Alignment AA Amino acid translation of the c_sequence_alignment field.
C Support C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool.
CDR1 Nucleotide sequence of the aligned CDR1 region.
CDR1 AA Amino acid translation of the cdr1 field.
CDR1 End CDR1 end position in the query sequence (1-based closed interval).
CDR1 Start CDR1 start position in the query sequence (1-based closed interval).
CDR2 Nucleotide sequence of the aligned CDR2 region.
CDR2 AA Amino acid translation of the cdr2 field.
CDR2 End CDR2 end position in the query sequence (1-based closed interval).
CDR2 Start CDR2 start position in the query sequence (1-based closed interval).
CDR3 Nucleotide sequence of the aligned CDR3 region.
CDR3 AA Amino acid translation of the cdr3 field.
CDR3 End CDR3 end position in the query sequence (1-based closed interval).
CDR3 Start CDR3 start position in the query sequence (1-based closed interval).
Cell ID Identifier defining the cell of origin for the query sequence.
Clone ID Clonal cluster assignment for the query sequence.
Consensus Count Number of reads contributing to the (UMI) consensus for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence.
D Alignment End End position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
D Alignment Start Start position of the D segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
D Call D gene with allele. For example, IGHD3-10*01.
D Cigar CIGAR string for the D gene alignment.
D Germline Alignment Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any).
D Germline Alignment AA Amino acid translation of the d_germline_alignment field.
D Germline End Alignment end position in the D gene reference sequence (1-based closed interval).
D Germline Start Alignment start position in the D gene reference sequence (1-based closed interval).
D IDentity Fractional identity for the D gene alignment.
D Score Alignment score for the D gene alignment.
D Sequence Alignment Aligned portion of query sequence assigned to the D segment, including any indel corrections or numbering spacers.
D Sequence Alignment AA Amino acid translation of the d_sequence_alignment field.
D Sequence End End position of the D segment in the query sequence (1-based closed interval).
D Sequence Start Start position of the D segment in the query sequence (1-based closed interval).
D Support D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the D gene assignment as defined by the alignment tool.
Duplicate Count Copy number or number of duplicate observations for the query sequence. For example, the number of UMIs sharing an identical sequence or the number of identical observations of this sequence absent UMIs.
FWR1 Nucleotide sequence of the aligned FWR1 region.
FWR1 AA Amino acid translation of the fwr1 field.
FWR1 End FWR1 end position in the query sequence (1-based closed interval).
FWR1 Start FWR1 start position in the query sequence (1-based closed interval).
FWR2 Nucleotide sequence of the aligned FWR2 region.
FWR2 AA Amino acid translation of the fwr2 field.
FWR2 End FWR2 end position in the query sequence (1-based closed interval).
FWR2 Start FWR2 start position in the query sequence (1-based closed interval).
FWR3 Nucleotide sequence of the aligned FWR3 region.
FWR3 AA Amino acid translation of the fwr3 field.
FWR3 End FWR3 end position in the query sequence (1-based closed interval).
FWR3 Start FWR3 start position in the query sequence (1-based closed interval).
FWR4 Nucleotide sequence of the aligned FWR4 region.
FWR4 AA Amino acid translation of the fwr4 field.
FWR4 End FWR4 end position in the query sequence (1-based closed interval).
FWR4 Start FWR3 start position in the query sequence (1-based closed interval).
Germline Alignment Assembled, aligned, fully length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any).
Germline Alignment AA Amino acid translation of the assembled germline sequence.
Germline Database Source of germline V(D)J genes with version number or date accessed. For example, 'IMGT/GENE-DB 3.1.18 (15 March 2018)'.
J Alignment End End position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
J Alignment Start Start position of the J segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
J Call J gene with allele. For example, IGHJ4*02.
J Cigar CIGAR string for the J gene alignment.
J Germline Alignment Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any).
J Germline Alignment AA Amino acid translation of the j_germline_alignment field.
J Germline End Alignment end position in the J gene reference sequence (1-based closed interval).
J Germline Start Alignment start position in the J gene reference sequence (1-based closed interval).
J IDentity Fractional identity for the J gene alignment.
J Score Alignment score for the J gene alignment.
J Sequence Alignment Aligned portion of query sequence assigned to the J segment, including any indel corrections or numbering spacers.
J Sequence Alignment AA Amino acid translation of the j_sequence_alignment field.
J Sequence End End position of the J segment in the query sequence (1-based closed interval).
J Sequence Start Start position of the J segment in the query sequence (1-based closed interval).
J Support J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool.
Junction AA Junction region amino acid sequence.
Junction Length Number of nucleotides in the junction sequence.
Junction NT Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
Locus Gene locus (chain type). For example, IGH, IGI, IGK, IGL, TRA, TRB, TRD, or TRG.
N1 Length Number of untemplated nucleotides 5' of the D segment.
N2 Length Number of untemplated nucleotides 3' of the D segment.
Np1 Nucleotide sequence of the combined N/P region between the V and D segments or V and J segments.
Np1 AA Amino acid translation of the np1 field.
Np1 Length Number of nucleotides between the V and D segments or V and J segments.
Np2 Nucleotide sequence of the combined N/P region between the D and J segments.
Np2 AA Amino acid translation of the np2 field.
Np2 Length Number of nucleotides between the D and J segments.
P3D Length Number of palindromic nucleotides 3' of the D segment.
P3V Length Number of palindromic nucleotides 3' of the V segment.
P5D Length Number of palindromic nucleotides 5' of the D segment.
P5J Length Number of palindromic nucleotides 5' of the J segment.
Productive True if the V(D)J sequence is predicted to be productive.
Rearrangement ID Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a univerally unique record locator for database applications.
Rearrangement Set ID Identifier to the sequence annotation object for the repertoire in study metadata with the associated software processing for this rearrangement. If this field is empty than the primary sequence annotation is assumed.
Repertoire ID Identifier to the associated repertoire in study metadata.
Rev Comp True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of 'sequence'.
Sequence The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.
Sequence AA Amino acid translation of the query nucleotide sequence.
Sequence Alignment Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.
Sequence Alignment AA Amino acid translation of the aligned query sequence.
Sequence ID Unique query sequence identifier within the file. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment.
Stop Codon True if the aligned sequence contains a stop codon.
V Alignment End End position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
V Alignment Start Start position in the V segment in both the sequence_alignment and germline_alignment fields (1-based closed interval).
V Call V gene with allele. For example, IGHV4-59*01.
V Cigar CIGAR string for the V gene alignment.
V Germline Alignment Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any).
V Germline Alignment AA Amino acid translation of the v_germline_alignment field.
V Germline End Alignment end position in the V gene reference sequence (1-based closed interval).
V Germline Start Alignment start position in the V gene reference sequence (1-based closed interval).
V Identity Fractional identity for the V gene alignment.
V Score Alignment score for the V gene.
V Sequence Alignment Aligned portion of query sequence assigned to the V segment, including any indel corrections or numbering spacers.
V Sequence Alignment AA Amino acid translation of the v_sequence_alignment field.
V Sequence End End position of the V segment in the query sequence (1-based closed interval).
V Sequence Start Start position of the V segment in the query sequence (1-based closed interval).
V Support V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool.
Vj In Frame True if the V and J segment alignments are in-frame.