The following table presents some of the largest genes in the human genome. These sizes are from transcribed regions rounded to the nearest 0.01 Mb. For comparison, genes encoding some of the largest proteins have been included. For more information about these genes, see the section listed in the right column. Some of these genes produce a very large number of transcripts and isoforms. Many have functions in the development of the nervous system. Note that links for the gene names point to a single isoform / transcript and that the reference set may include others.
|Largest Genes in the Genome|
|PTPRD||2.30||receptor protein tyrosine phosphatase D||Protein Tyrosine Phosphatases|
|CSMD1||2.06||Additional Interaction Domain Families|
|MACROD2||2.06||Additional Genes in Development|
|EYS||1.99||Crystallins and Other Eye proteins|
|LRP1B||1.90||lipoprotein receptor family||Lipoproteins|
|CTNNA3||1.78||α catenin 3||Cadherins and Related Proteins|
|A2BP1||1.69||ataxin 2 binding protein||Cerebellum|
|FHIT||1.50||dinucleoside triphosphate hydrolase||Nucleotide Pathways|
|GPC5||1.47||glypican 5||Protein Glycosylation|
|MAGI2||1.44||membrane guanylate kinase||PDZ Domain|
|DPP10||1.40||dipeptidyl peptidase family||Serine Proteases|
|IL1RAPL1||1.37||receptor accessory protein||Interleukins and Their Receptors|
|PRKG1||1.30||protein kinase||Cyclic Nucleotides|
|DAB1||1.25||D. melanogaster disabled homolog 1||Additional Membrane Functions|
|ANKS1B||1.25||cajalin 2||Nucleus and Nucleolus|
|CSMD3||1.21||Additional Interaction Domain Families|
|IL1RAPL2||1.20||receptor accessory protein||Interleukins and Their Receptors|
|AUTS2||1.19||Fibroblast Growth Factors|
|DCC||1.19||netrin receptor||Netrins and Laminins|
|GPC6||1.18||glypican 6||Protein Glycosylation|
|CDH13||1.17||cadherin 13||Cadherins and Related Proteins|
|ERBB4||1.16||EGF receptor family||Epidermal Growth Factor|
|ACCN1||1.14||cation channel||Sodium Channels|
|CTNNA2||1.14||α catenin 2||Cadherins and Related Proteins|
|SPAG16||1.13||sperm antigen||Testes and Sperm|
|PTPRT||1.12||protein tyrosine phosphatase||Protein Tyrosine Phosphatases|
|CDH12||1.10||cadherin 12||Cadherins and Related Proteins|
|DPP6||~ 1.10||dipeptidyl peptidase family||Serine Proteases|
|PARD3B||1.07||tight junction protein||PDZ Domain|
|PTPRN2||1.05||protein tyrosine phosphatase||Protein Tyrosine Phosphatases|
|SOX5||1.03||transcription factor||SOX Family|
|Genes for Large Proteins|
|MUC16||0.13||mucin 16 (CA-125 antigen)||Mucins|
TTN and MUC16 are the largest proteins in the reference set but their genes are only a fraction of the size of the largest genes.
As can be seen in the following figure, in general, these large genes are dispersed along the chromosomes; however, SPAG16 and ERBB4 are very close to each other on chromosome 2. GPC5 and GPC6 are near each other on chromosome 13. Note the absence of large genes on the gene-rich chromosomes 19 and 22.
Related proteins are sometimes encoded by genes that have very different sizes. Although utrophin (UTRN) is encoded by a large gene (0.56 Mb), it is only a fraction of the size of dystrophin (DMD, 2.22 Mb). DAB2 is a 0.05-Mb gene, much smaller than DAB1 (1.25 Mb). LRP1 (0.08 Mb) is also much smaller than LRP1B (1.9 Mb).
As seen in the preceding table, two of the neurexins are encoded by very large genes but the third family member, NRXN2, is only 0.12 Mb. A similar situation is found with the roundabout (ROBO) family and several other neuronal protein families (see Neurons). The SNRPN gene in the Prader–Willi imprinted region and the SNRPB gene (see Capping and Splicing) also differ greatly in size but encode similar-sized proteins.
Alternate transcripts are a mechanism for producing isoforms targeted to distinct subcellular compartments. Isoforms are also produced in a tissue-specific manner. HK1 (see Hexokinases and Initial Sugar Metabolism) produces several isoforms from different transcripts, some of which are testes-specific.
The following figure shows the distribution of exon number for human genes. The number of genes with a given exon count is the y coordinate. It uses the gene set described at the beginning of this section. The distribution has a mode of four exons and a median of eight exons. The small number of genes with over 100 exons (see table later in this section) is not plotted.
Many genes are interrupted by an extremely large number of introns. The following table presents some of them. Note that these are not the largest genes in the genome, but they encode many of the largest proteins. Only one transcript from each gene was used. The total number of exons for the gene may be larger than shown. Also, not all transcripts from these genes may be present in the current data sets. The links in the table point to the isoform / transcript with the indicated number of exons.
|Genes with the Most Exons|
|SYNE1||146||nesprin 1||Spectrin and Plectin Families|
|COL7A1||118||collagen type VII||Collagen|
|SYNE2||116||nesprin 2||Spectrin and Plectin Families|
|HMCN1||107||hemicentin 1||Additional Immunoglobulin-related Receptors|
|RYR1||106||skeletal muscle ryanodine receptor||Muscle|
|UBR4||106||retinoblastoma-associated protein||RB1 and Related Functions|
|RYR2||105||cardiac muscle ryanodine receptor||Muscle|
|SSPO||103||subcommissural organ spondin||Additional Genes in Development|
|MDN1||102||midasin||Nucleus and Nucleolus|
Many proteins are encoded by genes with a single exon or have multiple exons but no introns in their protein-coding regions. Examples are found in the histones, the olfactory and other G-coupled receptors, the interferons, and some members of the FOX family. As seen in the preceding table, large proteins are generally encoded in genes interrupted by many introns. A notable exception is EPPK1 (epiplakin, a protein of over 5000 amino acids) which may lack introns in most or all of its coding sequence.
The following figures show how exon number correlates more with protein size than gene size, notably for genes with many exons.
The plot on the left has protein size (log scale) on the x-axis. Gene size (log scale) is the x-axis in the plot at right. Exon number is given on the y-axis (log scale). Single-exon genes are the points along the x-axis. Note the differing scales on the x-axes. The log scales help present the wide data range. The gene set used here is the same as that used in the figure on exon numbers for human genes. Gene size is the span of the transcribed region. The UTRs may be underestimated (see below).
The final plot in this series presents gene size against protein size. A positive correlation is observed.
The same gene / transcript set was used as in the previous figures. The roughly linear set of points at the bottom of the cluster derives from single-exon genes with very small reported UTRs.
Introns vary over a very large size range. The following table uses the same gene set used to produce the figures on exon numbers to present median intron sizes. The table shows data for genes with 2 through 16 exons (1 through 15 introns). The "Gene count" column is the number of examples of that type. Note the greatly increased size for the first introns of genes compared to their subsequent introns and the inreasing size of first and other early introns for genes with many exons.
|Median Intron Sizes|
The following table lists some of the largest documented introns in the genome. Very large introns are, by necessity, found in large genes. This list overlaps with the list of the largest genes earlier in this section. Note how the genes with the largest introns vary considerably in the number of introns they contain. DPP6, a very large gene spanning an assembly gap, also is likely to contain a very large intron. Many of the genes listed in this table have multiple entries in the reference set for distinct isoforms and transcripts. The links in the following table point to the isoform / transcript with the indicated large intron.
|Genes with the Largest Introns|
|Intron count||Largest intron (bp)||Protein||Section|
|KCNIP4||1,220,136||7||1,097,903||Kv channel interacting protein||Potassium Channels|
|ACCN1||1,143,721||9||1,043,911||cation channel||Sodium Channels|
|DPP10||1,402,038||25||866399||dipeptidyl peptidase family||Serine Proeases|
|HS6ST3||748,720||1||740,920||heparan sulfate sulfotransferase||Protein Glycosylation|
|GPC5||1,468,556||7||721,292||glypican 5||Protein Glycosylation|
|PDE4D||924,757||14||677,200||cAMP phosphodiesterase||Cyclic Nucleotides|
|PCDH9||927,503||3||593,993||protocadherin 9||Cadherins and Related Proteins|
|RORA||741,020||10||550,366||RAR-related receptor||Nuclear Receptors|
|MACROD2||2,057,697||16||544,980||Additional Genes in Development|
|IL1RAPL2||1,200,827||10||536,480||receptor accessory protein||Interleukins and Their Receptors|
|FGF14||680,920||4||526,174||fibroblast growth factor 14||Fibroblast Growth Factors|
|FHIT||1,502,098||9||522,714||dinucleoside triphosphate hydrolase||Nucleotide Pathways|
|ODZ2||979,320||28||500,512||Additional Brain Proteins|
The following table gives exon size data for genes with up to 15 exons using the same genes set desribed for the corresponding intron table. The sizes of the first exons are likely underestimated because of incomplete cDNA clones. The sizes of the final exons are likely overestimated because longer mRNAs are often mapped onto the genome. They may include other poly(A) processing sites that would result in shorter mRNAs. Middle exons have a relatively consistent median size. This number declines modestly as number of exons in the transcript increases. For all middle exons from the full set of selected transcripts, the median value is 123 nucleotides.
|Median Exon Sizes|
A clear trend is seen where increased gene density (shown here as less sequenced DNA per gene) is associated with a decrease in gene size. The increased gene density is not simply similar-sized genes being closer to each other. The genes still occupy only a fraction of the DNA of the chromosomes (even if predicted genes were added to the set). The X and Y chromosomes and the autosomes with the highest and lowest gene densities are labeled. A notable exception to the trend is seen with the Y chromosome (and to a lesser degree with the X chromosome). The Y chromosome has relatively few genes compared to the other chromosomes.
The preceding figure is a cummulative plot of the distances from the center of the nearest CpG island to 5' ends of the transcripts mapped onto the reference genome (selected as before but also excluding genes assigned to chromosome fragments with no CpG island). Negative distances indicate an upstream relative location. Almost 61% of the selected transcripts have their starts within a CpG island. As can be seen in the figure, an even higher fraction of genes has RNA starts close to strictly defined CpG islands. The distribution has very long tails. For comparison, the equivalent calculation for mRNA 3' ends has a relatively flat distribution (not shown).
The gene information for this section is based on the release 37.1 reference genome sequence and the NCBI Map Viewer tables. The size of DPP6 is an estimate as it spans a gap in the assembly. The CpG islands used to prepare the figure were those defined as "strict" in the genome annotation.
The transcript set used to prepare the figures and tables was constructed from the set of transcripts in the Map Viewer tables. For each named gene, only one transcript with a largest encoded protein (in amino acids) was used. If a gene had multiple transcripts encoding proteins of that size, one with the most exons was retained. Transcript predictions were excluded. Similarly, genes reported with no untranslated region were also generally excluded (many of these were olfactory receptor genes). The retained set had 18,159 transcripts. It also excluded a small number of ambiguously placed transcripts and a few genes that span gaps in the assembly.
See also the additional reading for this chapter.