Chemistry - Where did amino acids get their one-letter codes?

Solution 1:

Tryptophan

For instance, is there a reason 'W' specifically was chosen for tryptophan (other than the fact that 'T' was taken)?

Once you have assigned the other 19 amino acids, there are only 7 letters of the alphabet left: B, J, O, U, W, X, and Z. (Certainly not a nice Scrabble hand to have!)

If one wants to use a letter found within the name of the amino acid, the only available letter would be O. However, the usage of U and O was historically discouraged because these letters could be easily confused with other letters (U with V; O with G, Q, C, D, and the number 0).

It turns out that the choice was made because W is a very fat letter and was reminiscent of the indole ring system present in tryptophan (the only amino acid to contain a bicyclic system). (See below for source.)


What happened to the other six letters, then?

All 26 letters of the alphabet now find use as a one-letter code for amino acids or various combinations thereof.

After their discoveries, selenocysteine and pyrrolysine (the latter only found in bacteria) were assigned U and O respectively. Furthermore, B is used to represent aspartic acid OR asparagine; J is used to represent leucine OR isoleucine; Z is used to represent glutamic acid OR glutamine; and X is used to represent an unknown amino acid.

I believe B and Z find use because, in protein sequencing, acid hydrolysis is often used to break peptide bonds. This has the undesirable side-effect of hydrolysing the amide groups in asparagine/glutamine, leading to the formation of aspartic/glutamic acids, which means that one cannot tell exactly which amino acid it was at the start. J is used in NMR spectroscopy where isoleucine and leucine are difficult to distinguish.


Why did they choose the letters they did?

As far as I am aware the usage of one-letter symbols is adopted by both IUPAC and IUB (since 1991, IUBMB) in their joint 1983 recommendations on "Nomenclature and Symbolism for Amino Acids and Peptides".1 ankit7540's concise summary of the historical development already mentioned these recommendations.

In particular, Section 3AA-21.2 "The Code Symbols" has a description of why the letters were chosen. This document is probably the most authoritative stance on the matter. The rationale is mostly in line with Jan's answer:

Initial letters of the names of the amino acids were chosen where there was no ambiguity. There are six such cases: cysteine, histidine. isoleucine, methionine, serine, and valine. All the other amino acids share the initial letters A, G, L, P or T, so arbitrary assignments were made. These letters were assigned to the most frequently occurring and structurally most simple of the amino acids with these initials, alanine (A), glycine (G), leucine (L), proline (P) and threonine (T).

Other assignments were made on the basis of associations that might be helpful in remembering the code, e.g. the phonetic associations of F for phenylalanine and R for arginine. For tryptophan the double ring of the molecule is associated with the bulky letter W. The letters N and Q were assigned to asparagine and glutamine respectively; D and E to aspartic and glutamic acids respectively. K and Y were chosen for the two remaining amino acids, lysine and tyrosine, because, of the few remaining letters, they were close alphabetically to the initial letters of the names. U and O were avoided because U is easily confused with V in handwritten material, and O with G, Q, C and D in imperfect computer print-outs, and also with zero. J was avoided because it is absent from several languages.

Two other symbols are often necessary for partly determined sequences, so B was assigned to aspartic acid or asparagine when these have not been distinguished; Z was similarly assigned to glutamic acid or glutamine. X means that the identity of an amino acid is undetermined, or that the amino acid is atypical.

One can only hypothesise what they meant by "associations that might be helpful in remembering the code" in the case of N/Q/D/E. My best guess is:

  • D and E were possibly chosen for aspartic and glutamic acids because they were the only consecutive pair of letters left, emphasising their chemical similarity. Aspartic acid is shorter than glutamic acid by one methylene group (CH2), so it gets the earlier letter D.
  • Glutamine sounds like Q-tamine. If you don't think it sounds similar, repeat it 50 times until you do.
  • AsparagiNe was assigned N.

Reference

  1. IUPAC-IUB Joint Commission on Biochemical Nomenclature. Nomenclature and Symbolism for Amino Acids and Peptides: Recommendations 1983. FEBS J. 1984, 138 (1), 9–37. DOI: 10.1111/j.1432-1033.1984.tb07877.x. A HTML version (perhaps more user-friendly) can be found at this address.

Solution 2:

Some single letter codes that aren’t the amino acid’s starting letter actually make sense when viewed from certain angles. Here’s the list starting with the bloomin’ obvious:

  • G — Glycine
  • A — Alanine
  • V — Valine
  • L — Leucine
  • I — Isoleucine
  • P — Proline
  • S — Serine
  • T — Threonine
  • C — Cysteine
  • H — Histidine
  • M — Methionine

Some amino acids have a letter that alludes to how the name is spoken rather than written:

  • F — Phenylalanine (starts with /f/)
  • R — Arginine (the first syllable could be are or the pronunciation of the letter R)
  • N — Asparagine[1]

One amino acid is labelled by its second letter (two if you count arginine above):

  • Y — Tyrosine

A few were labelled with letters close to their starting letter because the actual starting letter is taken:

  • K — Lysine (L is already leucine)
  • E — Glutamate (F and G were taken; but so were H and I. E is at least a letter of the name)
  • D — Aspartate (Basically a shorthand version of E, glutamate)

Which leaves us with tryptophan and glutamine. Glutamine is another really stretchy one:

  • Q — Glutamine (C and G was taken but Q is round on the left just like a G. Totally obvious?)

And finally, for tryptophan, they decided to take the actual shape of the molecule. You don’t see it? Here it is:

  • W — Tryptophan The W in tryptophan

Actually, perhaps it’s a mere coincidence as in they tossed a coin with the letters left; for both glutamine and tryptophan. But I like this explanation.


Solution 3:

See IUPAC-IUB Commission on Biochemical Nomenclature A One-Letter Notation for Amino Acid Sequences The Journal of Biological Chemistry Vol. 243, No. 13, pp. 3557-3559, 10 July 1968. :

The possibility of using one-letter symbols was mentioned by Gamow and Ycas (2) in 1958. The idea was systematized by Šorm et al. (3) in 1961. It was used by this group (4-10) and also by Fitch (11) in several papers on the structure of proteins. In extensive compilations of protein structures, Eck and Dayhoff (see References 12 to 14) systematically used one-letter symbols derived partly from the code of Šorm and Keil. Independent proposals were made by Wiswesser (15) and by Braunstein.

...

3.2 Initial letters are used where there is no ambiguity. There are six such cases: cysteine, histidine, isoleucine, methionine, serine, and valine. All the other amino acids share the initial letters A, G, L, P, or T; therefore, assignments of them must be somewhat arbitrary. These letters are assigned to the most frequently occurring and structurally most simple amino acids. On this basis, the letters A, G, L, P, and T are assigned to alanine, glycine, leucine, proline, and threonine, respectively.

3.3 The assignment of the other abbreviations is more arbitrary. However, certain clues are helpful. Two are phonetically suggestive, F for phenylalanine, and R for arginine. For tryptophan, the double ring in the molecule is associated with the bulky letter W. The letters N and Q are assigned to asparagine and glutamine, respectively; D and E are assigned to aspartic acid and glutamic acid, respectively. This leaves lysine and tyrosine, to which K and Y are assigned. These are chosen rather than any of the few other remaining letters because they are alphabetically nearest the initial letters L and T. U and O are avoided because U is easily confused with V in handwritten work and O is confused with G, Q, C, and D in imperfect computer print-outs and also with zero. J is avoided for linguistic reasons.

3.4 Two other abbreviations are necessary in order to avoid ambiguity. B is assigned to aspartic acid or asparagine when this distinction has not been determined. Z is assigned when glutamic acid and glutamine have not been distinguished. X means that the identity of an amino acid is undetermined or that the amino acid is atypical.


Solution 4:

The three-letter designation came before the single letter code. The three-letter codes themselves went several refinements and revision finally accepted by IUPAC and International Union of Biochemistry under the leadership of HBF Dixon. Later single letter codes were proposed. (apparently, because they were tiresome to write). This proposal came from Czech biochemists. (check history of the nomenclature of amino acids)

Later, the single letter codes were recommended and accepted in 1994 (in a similar meeting of IUPAC and IUB). The recommendation can be found here Amino acid nomenclature recommendation, also available here NCBI. After acceptance of this new nomenclature, several major publications published the new guidelines.

Tags:

Nomenclature