In the following table, two methods were used to calculate amino acid usage in the 18,886 selected proteins. In the "By protein" column, compositions were calculated for each of the proteins and then averaged. In the "By sequence" column, the usage is from treating the 18,886 sequences as one long sequence. The latter method is weighted toward the usage in larger proteins. These numbers are not weighted for expression.
Amino acid Usage (%) By protein By sequence A alanine 7.214 7.010 C cysteine 2.491 2.284 D aspartate 4.591 4.767 E glutamate 6.839 7.124 F phenylalanine 3.830 3.664 G glycine 6.716 6.577 H histidine 2.592 2.623 I isoleucine 4.378 4.352 K lysine 5.749 5.745 L leucine 10.091 9.964 M methionine 2.284 2.138 N asparagine 3.484 3.603 P proline 6.174 6.285 Q glutamine 4.578 4.751 R arginine 5.804 5.636 S serine 7.944 8.302 T threonine 5.149 5.315 U selenocysteine 0.001 0.000 V valine 6.023 5.980 W tryptophan 1.277 1.207 Y tyrosine 2.793 2.670
There are significant variations from the values above in the usage of many amino acids at the amino termini and carboxyl termini of proteins. These differences may be related to frequent modifications, or other processing and degradation pathways. One example of note is the elevated level of cysteine four positions from the carboxyl terminus, likely reflecting prenylation.
The genome encodes several families of proteins with very unusual amino acid compositions. Many of these are smaller proteins such as the protamines, late cornified envelope proteins, and metallothioneins.
The following table provides some additional examples of individual proteins and gene families where larger proteins have unusual compositions. The numbers given are residues for that amino acid and the total size of the protein. Some predicted proteins have been excluded. The relative fractions vary among the amino acids with the tryptophan-rich proteins being considerably lower than the others. For additional imformation about these proteins, see the sections listed in the right column of the table
|Proteins with High Fractions of Individual Amino acids|
|Amino acid||Protein (aa fraction)||Section|
|histone H1 family||Histones, Related Proteins, and Modifying Enzymes|
|BASP1 (57/227)||Additional Brain Proteins|
|HOXA13 (93/388)||HOX Genes|
|arginine||arginine- / serine-rich splicing factors||Capping and Splicing|
|asparagine||PYGO1 (50/419)||B cells|
|aspartate||DSPP (259/1301)||Bone and Related Tissues|
|ACRC (122/691)||Nucleus and Nucleolus|
|SPP1 (48/314)||Bone and Related Tissues|
|ANP32B (38/251)||Nucleus and Nucleolus|
|glutamate||TCHH (526/1943)||Skin and Related Tissues|
|RPGR (307/1152)||Crystallins and Other Eye Proteins|
|ANP32E (71/268)||Nucleus and Nucleolus|
|NSBP1 (73/282)||Nonhistone Chromosomal Proteins|
|glutamine||ZNF853 (264/659)||Krüppel-related Zinc Finger Proteins|
|IVL (150/585)||Skin and Related Tissues|
|glycine||LOR (145/312)||Skin and Related Tissues|
|GAR1 (73/217)||Nucleus and Nucleolus|
|histidine||HRC (89/699)||Calmodulin and Calcium|
|SLC39A7 (57/469)||Solute Carrier Families|
|isoleucine||olfactory receptor families||Olfactory Receptors|
|type 2 taste receptors||Taste Receptors|
|leucine||MFSD3 (104/412)||Solute Carrier Families|
|GP1BB (47/206)||Platelets and Megakaryocytes|
|SLC39A5 (123/540)||Solute Carrier Families|
|lysine||histone H1 family||Histones, Related Proteins, and Modifying Enzymes|
|CYLC2 (92/348)||Testes and Sperm|
|methionine||RGAG1 (145/1388)||DNA Transposons and Retrovirus-related Sequences|
|phenylalanine||DERL2 (31/239)||ER, Golgi, and the Secretory Pathway|
|ALG10 (58/473)||Protein Glycosylation|
|ALG10B (57/473)||Protein Glycosylation|
|proline||proline-rich salivary proteins||Lacrimal and Salivary Glands|
|serine||DSPP (542/1301)||Bone and Related Tissues|
|HRNR (957/2850)||Skin and Related Tissues|
|tryptophan||CCDC70 (16/233)||Coiled-Coil Proteins|
|tyrosine||DAZ2 (66/558)||Testes and Sperm|
|DAZ3 (46/438)||Testes and Sperm|
|valine||PRLHR (54/370)||Growth Hormone and Related Hormones|
|GPR141 (40/305)||G-Protein-coupled Receptors|
|FAHD2A (41/314)||Additional Enzymes and Related Sequences|
Many proteins contain short proline-rich regions. Some proteins, such as certain members of the formin family have very large proline-rich regions that affect the overall composition of the proteins. A similar situation is seen with the leucine-rich repeat proteins.
The small number of proteins containing selenocysteine are described separately (see Selenium Proteins).
|Proteins with Large Homopolymer tracts|
|Amino Acid||Protein||Tract length (aa)||Section|
|alanine||PHOX2B||20||Homeobox and Related Proteins|
|FBRS||19||Fibroblast Growth Factors|
|aspartate||HRC||16||Calmodulin and Calcium|
|ASPN||14||Leucine-rich Repeat Family|
|glutamate||MYT1||32||Oligodendrocytes and Myelin|
|EHMT2||24||Histones, Related Proteins, and Modifying Enzymes|
|TTBK1||23||Tubulin and Microtubules|
|DYRK1A||13||Dual-Specificity Protein Kinases|
|MEOX2||13||Homeobox and Related Proteins|
|ZFHX4||20||Homeobox and Related Proteins|
|TBP||38||RNA Polymerase and General Transcription Factors|
|EP400||29||Nonhistone Chromosomal Proteins|
|THAP11||29||Zinc Finger Proteins|
|SLC24A3||10||Solute Carrier Families|
|SRRM2||42||Capping and Splicing|
|MLLT3||42||PHD Finger Proteins|
|SETD1A||24||Histones, Related Proteins, and Modifying Enzymes|
|DACH1||24||Additional Genes in Development|
|threonine||CADM1||13||Additional Genes in Development|
|KDM6B||11||Histones, Related Proteins, and Modifying Enzymes|
The following table provides a list of the largest proteins in the reference set. Only one isoform is listed for each. Predicted proteins are not listed. Note also the very large predicted LOC643677 (7081 aa) and HMCN2 (5065 aa).
|MUC16||14507||mucin 16 (CA-125 antigen)||Mucins|
|SYNE1||8797||nesprin 1||Spectrin and Plectin Families|
|SYNE2||6907||nesprin 2||Spectrin and Plectin Families|
|MACF1||5938||filament crosslinking protein||Spectrin and Plectin Families|
|DST||5675||dystonin||Spectrin and Plectin Families|
|HMCN1||5635||hemicentin||Additional Genes in Development|
|MDN1||5596||midasin||Nucleus and Nucleolus|
|MLL2||5537||PHD Finger Proteins|
|FCGBP||5405||Fc-binding protein||Fc Receptors|
|USH2A||5202||usherin||Auditory and Vestibular Functions|
|UBR4||5183||retinoblastoma-associated protein||RB1 and Related Functions|
|SSPO||5147||subcommissural organ spondin||Additional Genes in Development|
|HYDIN||5120||Additional Brain Proteins|
|EPPK1||5090||epiplakin 1||Spectrin and Plectin Families|
|ABCA13||5058||ATP-binding Cassette Proteins|
Many of the proteins listed above contain spectrin-type repeats. Additional large proteins are listed with that family. Larger proteins often contain repeating domains such as those first identified in epidermal growth factor and fibronectin.
Proteins with the γ-carboxyglutamate modification are described in the section on coagulation. The following figure shows the amino acid usage (darker being more conserved) in a partial alignment of 11 of these proteins (see Notes and References). Note the completely conserved glutamate residues near the center of the alignments. Interpretation of such alignments can be complex. In this case, a number of these proteins are also processed by cleavage amino-terminal to the relatively conserved alanine at position 18 in the figure.
Another example of shared sequences around the location of a modified amino acid is seen at the active site of sulfatases. In these enzymes, a cysteine is converted to formylglycine.
The tables in this section were constructed using the human RefSeq proteins set available at the time release 37.1 of the human reference genome sequence became available. There are some differences in this protein set and the genes annotated onto the reference genome.
The RefSeq proteins are associated with specific transcripts and there are often multiple transcripts for a given gene that may produce distinct or identical protein products. As explained in this section, this protein set was reduced by eliminating gene predictions and then choosing a single largest isoform for each gene. Also, only protein sequences derived from the reference mitochondrial genome were retained.
To produce the figure on carboxyglutamate-containing proteins, amino acids 24-85 from PROZ were used in searches to produce the alignments. The proteins used are those listed in the example in the section on coagulation except for PRRG2. MGP and BGLAP were also omitted.
See also the additional reading for this chapter.