During translation, ribosomes synthesize polypeptides using RNA molecules as templates. All cellular proteins are products of translation, and the identification of protein-coding regions is the primary goal of genome annotation. Beyond protein synthesis, translation has long been known to have regulatory functions independent of its products1,2. However, only with the advent of ribosome profiling was the broad scale and complexity of translated regions fully appreciated. Recent interest in pervasive translation has exposed a lack of general terminology for translated regions that does not depend on the properties of their products or their sequence. In the absence of such terminology, a range of inconsistently defined terms are used to describe them. These terms are typically variations on ‘open reading frame’ (ORF) — for example, non-canonical ORF (ncORF), RiboSeq ORF, alternative ORF (altORF), translated ORF (tORF), small ORF (smORF), short ORF (sORF) and others. Such terms largely overlap and redundantly describe the same core concept: that the region in question is translated. Denoting translated regions with ORF-based terms is problematic for two main reasons. First, in sequence analysis, ORFs are defined purely by the nucleotide sequence and the relevant genetic code (starts and stops; Fig. 1a). Thus, ORFs are found everywhere in the genome, including regions that are not even transcribed. ORF by this definition is used in the prediction of protein-coding regions. For example, as stop codons are avoided in coding regions, longer ORFs are more likely to contain sequences encoding proteins. Thus, reusing the same term, ORF, to specify regions that are known or confidently predicted to be translated leads to confusion. Indeed, for annotated protein-coding regions, a different technical term, CDS (historically, from ‘coding DNA sequence’), is commonly used. The second problem is that, although most known protein-coding regions in the majority of model organisms conform to an ORF, starting at an AUG and terminating at the first in-frame stop, there are many alternatives. In addition to AUG, other triplets can also initiate translation3 (Fig. 1a), making it challenging to determine start codons purely from the nucleotide sequence. The incorporation of non-standard proteinogenic amino acids (selenocysteine or pyrrolysine) usually occurs at stop codons within the CDS, with over a hundred in CDSs encoding selenoproteins in some species4 (Fig. 1a). Furthermore, in some organisms, termination codons are defined not only by their sequence, but also by their position within an mRNA. Ribosomal frameshifting (Fig. 1a), which involves translation of two different reading frames to produce a single protein, is common in viruses. It also occurs in nuclear-encoded genes of most organisms — in some species, in 5–20% of the genes6. Because the products of translation of CDSs containing frameshifting cannot be derived automatically by converting nucleotide triplets to amino acids, current annotation practice is to introduce a ‘pseudo-intron’ between two partial ORFs in the annotation of these genes. In the case of the specific example shown in Fig. 1b, conceptual translation of the sequence containing such a 2-nt pseudo-intron gives a polypeptide sequence that differs from the full-length product of actual translation by one amino acid. Most concerning is the practice of automatically modifying RNA sequences inferred from the genome to obtain a full-length protein product via triplet translation (Fig. 1c). Such artificial fitting of well-established translation mechanisms to a simplified model reinforces a widespread fallacy that every translated region can be represented by a single ORF. This leads to an inaccurate and oversimplified representation of genetic information and molecular composition of the cells, with potentially serious consequences for those unaware of the underlying makeshift solutions. Similar terminological confusion, such as the routine conflation of ‘exons’ and ‘protein-coding regions’7, highlights the need for precise vocabulary relating to gene expression. The above problems can be alleviated with the introduction of a specific term for a translated region that would be defined without reference to the product of translation or to the sequence of the region. For this purpose, we suggest using the term ‘translon’ (short for ‘translated region’), which has previously been introduced but failed to gain traction8. It aligns well with other terms describing gene structures: intron (‘intragenic region’) and exon (‘expressed region’)9. We intend the term translon to denote any region that is decoded by the ribosome. This ranges from minimal sequences with detectable translation (AUG followed by a stop) to sequences encoding long proteins in multiple reading frames, and even those disrupted by non-coding sequences, as in translational bypassing10 (Fig. 1a). It will also facilitate efforts to characterize unannotated translation: newly identified translated regions can be described as novel translons. Their biological roles, if any, can remain enigmatic until specific information is obtained. These roles can be purely regulatory, independent of the translation products, or involving those products, such as short peptides modulating functions of other proteins, signaling peptides or antigens; some translons could encode neutral or even harmful ‘protein junk’. Translon would fill the gap in the vocabulary used to describe units of genetic information (Fig. 1d). Unlike the existing terminology, translon is defined directly, by the process it aims to capture, instead of indirectly, through sequence or function. Most ORFs are not translons because they are not translated (Fig. 1d). All CDSs are translons, but not all translons are ORFs. We expect that the term translon will reduce confusion when discussing translated regions and will facilitate development of more biologically realistic annotations.

Translon: a single term for translated regions

Viero, Gabriella;
2025

Abstract

During translation, ribosomes synthesize polypeptides using RNA molecules as templates. All cellular proteins are products of translation, and the identification of protein-coding regions is the primary goal of genome annotation. Beyond protein synthesis, translation has long been known to have regulatory functions independent of its products1,2. However, only with the advent of ribosome profiling was the broad scale and complexity of translated regions fully appreciated. Recent interest in pervasive translation has exposed a lack of general terminology for translated regions that does not depend on the properties of their products or their sequence. In the absence of such terminology, a range of inconsistently defined terms are used to describe them. These terms are typically variations on ‘open reading frame’ (ORF) — for example, non-canonical ORF (ncORF), RiboSeq ORF, alternative ORF (altORF), translated ORF (tORF), small ORF (smORF), short ORF (sORF) and others. Such terms largely overlap and redundantly describe the same core concept: that the region in question is translated. Denoting translated regions with ORF-based terms is problematic for two main reasons. First, in sequence analysis, ORFs are defined purely by the nucleotide sequence and the relevant genetic code (starts and stops; Fig. 1a). Thus, ORFs are found everywhere in the genome, including regions that are not even transcribed. ORF by this definition is used in the prediction of protein-coding regions. For example, as stop codons are avoided in coding regions, longer ORFs are more likely to contain sequences encoding proteins. Thus, reusing the same term, ORF, to specify regions that are known or confidently predicted to be translated leads to confusion. Indeed, for annotated protein-coding regions, a different technical term, CDS (historically, from ‘coding DNA sequence’), is commonly used. The second problem is that, although most known protein-coding regions in the majority of model organisms conform to an ORF, starting at an AUG and terminating at the first in-frame stop, there are many alternatives. In addition to AUG, other triplets can also initiate translation3 (Fig. 1a), making it challenging to determine start codons purely from the nucleotide sequence. The incorporation of non-standard proteinogenic amino acids (selenocysteine or pyrrolysine) usually occurs at stop codons within the CDS, with over a hundred in CDSs encoding selenoproteins in some species4 (Fig. 1a). Furthermore, in some organisms, termination codons are defined not only by their sequence, but also by their position within an mRNA. Ribosomal frameshifting (Fig. 1a), which involves translation of two different reading frames to produce a single protein, is common in viruses. It also occurs in nuclear-encoded genes of most organisms — in some species, in 5–20% of the genes6. Because the products of translation of CDSs containing frameshifting cannot be derived automatically by converting nucleotide triplets to amino acids, current annotation practice is to introduce a ‘pseudo-intron’ between two partial ORFs in the annotation of these genes. In the case of the specific example shown in Fig. 1b, conceptual translation of the sequence containing such a 2-nt pseudo-intron gives a polypeptide sequence that differs from the full-length product of actual translation by one amino acid. Most concerning is the practice of automatically modifying RNA sequences inferred from the genome to obtain a full-length protein product via triplet translation (Fig. 1c). Such artificial fitting of well-established translation mechanisms to a simplified model reinforces a widespread fallacy that every translated region can be represented by a single ORF. This leads to an inaccurate and oversimplified representation of genetic information and molecular composition of the cells, with potentially serious consequences for those unaware of the underlying makeshift solutions. Similar terminological confusion, such as the routine conflation of ‘exons’ and ‘protein-coding regions’7, highlights the need for precise vocabulary relating to gene expression. The above problems can be alleviated with the introduction of a specific term for a translated region that would be defined without reference to the product of translation or to the sequence of the region. For this purpose, we suggest using the term ‘translon’ (short for ‘translated region’), which has previously been introduced but failed to gain traction8. It aligns well with other terms describing gene structures: intron (‘intragenic region’) and exon (‘expressed region’)9. We intend the term translon to denote any region that is decoded by the ribosome. This ranges from minimal sequences with detectable translation (AUG followed by a stop) to sequences encoding long proteins in multiple reading frames, and even those disrupted by non-coding sequences, as in translational bypassing10 (Fig. 1a). It will also facilitate efforts to characterize unannotated translation: newly identified translated regions can be described as novel translons. Their biological roles, if any, can remain enigmatic until specific information is obtained. These roles can be purely regulatory, independent of the translation products, or involving those products, such as short peptides modulating functions of other proteins, signaling peptides or antigens; some translons could encode neutral or even harmful ‘protein junk’. Translon would fill the gap in the vocabulary used to describe units of genetic information (Fig. 1d). Unlike the existing terminology, translon is defined directly, by the process it aims to capture, instead of indirectly, through sequence or function. Most ORFs are not translons because they are not translated (Fig. 1d). All CDSs are translons, but not all translons are ORFs. We expect that the term translon will reduce confusion when discussing translated regions and will facilitate development of more biologically realistic annotations.
2025
Istituto di Biofisica - IBF - Sede Secondaria Trento
...
File in questo prodotto:
File Dimensione Formato  
s41592-025-02810-3.pdf

solo utenti autorizzati

Descrizione: Translon: a single term for translated regions
Tipologia: Versione Editoriale (PDF)
Licenza: Nessuna licenza dichiarata (non attribuibile a prodotti successivi al 2023)
Dimensione 1.19 MB
Formato Adobe PDF
1.19 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/557963
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? ND
social impact