RNA to Protein Translation
JavaScript (ES6) 167 177 chars encoded in UTF8 as 167 177 bytes
... so I hope everyone is happy.
Edit In fact, no need for a special case for last block too short. If the last 2 (or 1) chars are not mapped, the result string does not end with '*' and that gives error anyway.
F=s=>/^M[^*]*\*$/.test(s=s.replace(/.../g,x=>
"KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"
[[for(c of(r=0,x))r=r*4+"ACGU".search(c)]|r]))?s:'Error'
Explained
Each char in a triplet can have 4 values, so there are exactly 4^3 == 64 triplets. The C function map each triplet to a number between 0 and 63. No error check needed as input chars are only ACGU.
C=s=>[for(c of(r=0,s))r=r*4+"ACGU".search(c)]|r
Each triplet maps to an aminoacid identified by a single char. We can encode this in a 64 character string. To obtain the string, start with the Codon Map:
zz=["* UAA UAG UGA","A GCU GCC GCA GCG","C UGU UGC","D GAU GAC","E GAA GAG"
,"F UUU UUC","G GGU GGC GGA GGG","H CAU CAC","I AUU AUC AUA","K AAA AAG"
,"L UUA UUG CUU CUC CUA CUG","M AUG","N AAU AAC","P CCU CCC CCA CCG","Q CAA CAG"
,"R CGU CGC CGA CGG AGA AGG","S UCU UCC UCA UCG AGU AGC","T ACU ACC ACA ACG"
,"V GUU GUC GUA GUG","W UGG","Y UAU UAC"]
a=[],zz.map(v=>v.slice(2).split(' ').map(x=>a[C(x)]=v[0])),a.join('')
...obtaining "KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV*Y*YSSSS*CWCLFLF"
So we can scan the input string and use the same logic of the C function to obtain the code 0..63, and from the code the aminoacid char. The replace function wil split the input string in 3 chars blocks, eventually leaving 1 or 2 chars not managed (that will give an invalid result string, not ending in '*').
At last, check if the encoded string is valid using a regexp: it must start with 'M', must not contain '*' and must end with '*'
Test In FireBug/FireFox console
;['AUGCUAG','GGGCACUAG','AUGAACGGA','AUGUAGUGA','AAAAAAA',
'AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA',
'AUGAGGUGUAGCUGA','AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA',
'AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG']
.forEach(c=>console.log(c,'->',F(c)))
Output
AUGCUAG -> Error
GGGCACUAG -> Error
AUGAACGGA -> Error
AUGUAGUGA -> Error
AAAAAAA -> Error
AUGUUUGUUCCGUCGAAAUACCUAUGAACACGCUAA -> Error
AUGAGGUGUAGCUGA -> MRCS*
AUGCCAGUCGCACGAUUAGUUCACACGCUCUUGUAA -> MPVARLVHTLL*
AUGCUGCGGUCCUCGCAUCUAGCGUUGUGGUUAGGGUGUGUAACUUCGAGAACAGUGAGUCCCGUACCAGGUAGCAUAAUGCGAGCAAUGUCGUACGAUUCAUAG -> MLRSSHLALWLGCVTSRTVSPVPGSIMRAMSYDS*
C,190 bytes (function)
f(char*x){int a=0,i=0,j=0,s=1;for(;x[i];i%3||(s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42,x[j]=a=0))a=a*4+(-x[i++]/2&3);puts(*x-77||i%3||s||x[j-1]-42?"Error":x);}
199 194 bytes (program)
a,i,j;char x[999];main(s){for(gets(x);x[i];i%3||(s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42,x[j]=a=0))a=a*4+(-x[i++]/2&3);puts((*x-77||i%3||s||x[j-1]-42)?"Error":x);}
Saved a few bytes by improving hash formula.
Here's a fun test case:
AUGUAUCAUGAGCUCCUUCAGUGGCAAAGACUUGACUGA --> MYHELLQWQRLD*
Explanation
The triplet of letters is converted to a base 4 number. Each letter is hashed as follows.
x[i] ASCII code Hashed to (-x[i]/2&3)
A 64+ 1 1000001 00
G 64+ 7 1000111 01
U 64+21 1010101 10
C 64+ 3 1000011 11
This gives a number in the range 0..63
. The idea now is to use a lookup table similar to the ones used by edc65 and Optimizer. However, the hash is designed so that G and A are next to each other and U and C are next to each other.
Looking at the table at https://en.wikipedia.org/wiki/Genetic_code#RNA_codon_table, we see that with the letters ordered in this way, generally the last bit can be ignored. Only a 32-character lookup table is needed, except in two special cases.
See below first two letters, and corresponding amino acids (where 3rd letter is G/A, and where 3rd letter is U/C). Corrections for the two special cases that do not fit the 32-character table are hardcoded.
A/G U/C A/G U/C A/G U/C A/G U/C
AAX> K N AGX> R S AUX> I I ACX> T T
GAX> E D GGX> G G GUX> V V GCX> A A
UAX> * Y UGX> * C UUX> L F UCX> S S
CAX> Q H CGX> R R CUX> L L CCX> P P
Corrections for special cases (where last bit cannot be ignored)
AUG 001001=9 --> M
UGG 100101=37--> W
Commented code
In golfed version the i%3
code is in the increment position of the for
bracket, but it is moved into a more readable position in the commented code.
a,i,j;char x[999]; //Array x used for storing both input and output. i=input pointer, j=output pointer.
main(s){ //s is commandline string count. if no arguments, will be set to zero. Will be used to count stops.
for(gets(x);x[i];) //Get a string, loop until end of string (zero byte) found
a=a*4+(-x[i++]/2&3), //Hash character x[i] to a number 0-3. leftshift any value already in a and add the new value. Increment i.
i%3||( //if i divisible by 3,
s-=(x[j++]=a-37?a-9?"KNRSIITTEDGGVVAA*Y*CLFSSQHRRLLPP"[a/2]:77:87)==42, //lookup the correct value in the table. for special cases a=24 and a=32 map to 'M' and 'W' (ASCII 77 and 87). If character is '*' (ASCII42) decrement s.
x[j]=a=0 //reset a to 0. clear x[j] to terminate output string.
);
puts((*x-77||i%3||s||x[j-1]-42)?"Error":x); //if first character not M or i not divisible by 3 or number of stops not 1 or last character not * print "Error" else print amino acid chain.
}
CJam, 317 121 104 bytes
q3/{{"ACGU"#}%4b"KN T RS IIMI QH P R L ED A G V *Y S *CWC LF"S/{_,4\/*}%s=}%_('M=\)'*=\'*/,1=**\"Error"?
This can still be golfed further.
Updated the mapping mechanism to the one used in edc65's answer. Even though I came up with this on my own, he beat me to it :)
UPDATE : Shortened the codon table map by observing the pattern in it.
Try it online here