What's the Language?

Bash, 100%, 100 bytes

sed sX..s.2./s.XX|grep -Po '(?<=>)[^<]+?(?=(,(?! W)| [-&–5]| ?<| [0-79]\d| ?\((?!E|1\.)))'|head -1

Try it online on Ideone.

Verification

$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ while read line; do bash headers.sh <<< "$line"; done < input | diff -s - output
Files - and output are identical

Retina 0.8.2, 100%, 75 71 70 68 67 64 59 53 51 bytes

<.*?>

(,| [-&(–5]| [0-7]\d)(?! W|...\)).*

2 |:

This is essentially code golf now, so I had to switch languages.

Try it online!

Verification

$ wget -q https://gist.githubusercontent.com/vihanb/1d99599b50c82d4a6d7f/raw/cd8225de96e9920db93613198b012749f9763e3c/testcases
$ grep -Po '(?<= - ).*' < testcases > input
$ grep -Po '^.*?(?= - )' < testcases > output
$ mono retina/Retina.exe headers.ret < input | head -n -1 | diff -s - output
Files - and output are identical

How it works

The code consists of three simple substitutions (or eliminations). Instead of trying to match the language name, we get rid of all parts of the input string that do form part of it.

<.*?> will match all HTML tags, so the substitution will eliminate them from the input.

.*? matches any amount of characters, but since ? makes the quantifier lazy, it will match the least amount possible that still allows the entire pattern to match. This avoid deleting the entire input, which will always begin with a < and end with a >.

The language name now begins with the first character of the remaining modified input string.
After the language's name, we will almost always find one of the following endings:

,, -, &, (, –, 5, or a space followed by two digits.

The first two endings are rather common, and Python 2 & PuLP... should be parsed as Python 2, Ruby (2.2.2p95)... as Ruby, >PHP – 3302 bytes as PHP, and Perl 5... as Perl.

(,| [-&(–5]| \d\d).* would match all these endings (and all characters after them), but it will result in a few false positives:
- , will match the comma in the language name Help, WarDoq!.
- ( will match the version of JavaScript (ESx) and Java (1.8).
- \d\d will match the version in Ti-Basic 84.
We can fix the third problem case by using [0-7]\d instead of \d\d, to avoid matching the 8 in 84.

For the other problem cases, we use the negative lookahead (?! W|...\)) that will prevent the preceding pattern from matching if it is followed by W (as in Help, WarDoq!) or by exactly three characters and a closing parenthesis (as in (ES6) or (1.8)).

Putting it all together, (,| [-&(–5]| [0-7]\d)(?! W|...\)).* matches everything after the language name.
We're left with two problem cases:
```
<h1>Python <s>2</s> 3, <s>255</s> <s>204</s> <s>180</s> 178 bytes</h1>
<h1><a href="http://sylwester.no/zozotez/" rel="nofollow">Zozotez Lisp</a>: 73</h1>
```
gets parsed as
```
Python 2 3
Zozotez Lisp:
```
We can fix the first by removing 2 and the second one by removing : from the output.

This is achieved by replacing 2 |: with the empty string.

CJam, 78.38% (76 bytes)

l{_'>#)>_c'<=}g_'<#<_{",-"&}#)_{_1$=',=+(<}{;}?

Try it online! or count the correct headers.

What's the Language?

Bash, 100%, 100 bytes

Verification

Retina 0.8.2, 100%, 75 71 70 68 67 64 59 53 51 bytes

Verification

How it works

CJam, 78.38% (76 bytes)

Tags:

Parsing

Code Challenge

Test Battery

Related

Recent Posts