What's the Language?
Bash, 100%, 100 bytes
sed sX..s.2./s.XX|grep -Po '(?<=>)[^<]+?(?=(,(?! W)| [-&–5]| ?<| [0-79]\d| ?\((?!E|1\.)))'|head -1
Try it online on Ideone.
Verification
$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ while read line; do bash headers.sh <<< "$line"; done < input | diff -s - output
Files - and output are identical
Retina 0.8.2, 100%, 75 71 70 68 67 64 59 53 51 bytes
<.*?>
(,| [-&(–5]| [0-7]\d)(?! W|...\)).*
2 |:
This is essentially code golf now, so I had to switch languages.
Try it online!
Verification
$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ mono retina/Retina.exe headers.ret < input | head -n -1 | diff -s - output
Files - and output are identical
How it works
The code consists of three simple substitutions (or eliminations). Instead of trying to match the language name, we get rid of all parts of the input string that do form part of it.
<.*?>
will match all HTML tags, so the substitution will eliminate them from the input..*?
matches any amount of characters, but since?
makes the quantifier lazy, it will match the least amount possible that still allows the entire pattern to match. This avoid deleting the entire input, which will always begin with a<
and end with a>
.The language name now begins with the first character of the remaining modified input string.
After the language's name, we will almost always find one of the following endings:
,
,-
,&
,(
,–
,5
, or a space followed by two digits.The first two endings are rather common, and
Python 2 & PuLP...
should be parsed asPython 2
,Ruby (2.2.2p95)...
asRuby
,>PHP – 3302 bytes
asPHP
, andPerl 5...
asPerl
.(,| [-&(–5]| \d\d).*
would match all these endings (and all characters after them), but it will result in a few false positives:,
will match the comma in the language nameHelp, WarDoq!
.(
will match the version ofJavaScript (ESx)
andJava (1.8)
.\d\d
will match the version inTi-Basic 84
.
We can fix the third problem case by using
[0-7]\d
instead of\d\d
, to avoid matching the8
in84
.For the other problem cases, we use the negative lookahead
(?! W|...\))
that will prevent the preceding pattern from matching if it is followed byW
(as inHelp, WarDoq!
) or by exactly three characters and a closing parenthesis (as in(ES6)
or(1.8)
).Putting it all together,
(,| [-&(–5]| [0-7]\d)(?! W|...\)).*
matches everything after the language name.We're left with two problem cases:
<h1>Python <s>2</s> 3, <s>255</s> <s>204</s> <s>180</s> 178 bytes</h1> <h1><a href="http://sylwester.no/zozotez/" rel="nofollow">Zozotez Lisp</a>: 73</h1>
gets parsed as
Python 2 3 Zozotez Lisp:
We can fix the first by removing
2
and the second one by removing:
from the output.This is achieved by replacing
2 |:
with the empty string.
CJam, 78.38% (76 bytes)
l{_'>#)>_c'<=}g_'<#<_{",-"&}#)_{_1$=',=+(<}{;}?
Try it online! or count the correct headers.