Predict whether a message will be starred or not in 50 bytes

Retina, 50 bytes, 71.8% 72.15%

^.*([[CE;ಠ-ﭏ]|tar|ol|l.x|eo|a.u|pin|nu|o.f|"$)

Tried some regex golfing at @MartinBüttner's suggestion. This matches 704 starred messages and doesn't match 739 unstarred messages.

The ^.*( ... ) is to make sure that there is always either 0 or 1 match, since Retina outputs the number of matches by default. You can score the program on the input files by prepending m` for multiline mode, then running

Retina stars.retina < starred.txt

and likewise for unstarred.txt.

Analysis / explanation

I generated the above snippets (and many more) using a program, then selected the ones I wanted manually. Here's some intuition as to why the above snippets work:

C: Matches PPCG, @CᴏɴᴏʀO'Bʀɪᴇɴ
E: Matches @ETHproductions, @El'endiaStarman
;: Because the test cases are HTML, this matches < and >
ಠ-ﭏ: Matches a range of Unicode characters, most prominently for ಠ_ಠ and @Doorknob冰
tar: Matches variations of star, @El'endiaStarman (again) and also gravatar which appears in the oneboxes posted by new posts bots
ol: Matches rel="nofollow" which is in a lot of links and oneboxes
l.x: Matches @AlexA., @trichoplax
eo: Mainly matches people, but also three cases for @Geobits
a.u: Mainly matches graduation, status, feature and abuse
pin: Matches ping and words ending in ping. Also matches a few posts in a discussion about pineapple, as an example of overfitting.
nu: Matches a mixed bag of words, the most common of which is number
o.f: Matches golf, conf(irm|use)
"$: Matches a double quote as a last character, e.g. @phase He means "Jenga."

The [ is nothing special - I just had a character left over so I figured I could use it to match one more case.

JavaScript ES6, 50 bytes, 71.10%

Correctly identifies 670 starred and 752 non-starred.

x=>/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)

Now across the 70% barrier, and beating everyone except Retina!

Returns true if the message contains any of these things:

A word of which the second letter is D, E, R, or v;
tar (usually star);
a and u with one char in between;
l and x with one char in between (usually alex);
italic text;
eo or ol;
a C, a semicolon, or a ಠ.

Here's a few more fruitful matches that don't seem to be worth getting rid of others:

nf
nu
yp
n.m

This has been growing closer and closer to the Retina answer, but I have found most of the improvements on my own.

Test it out in the console of one of these pages: star texts, no-star texts

var r=document.body.textContent.replace(/\n<br/g,"<br").split("\n").slice(0,-1);
var s=r.filter(function(x){return/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)}).length;
console.log("Total:",r.length,"Matched:",s,"Not matched:",r.length-s);

Here's an alternate version. /a/.test is technically a function, but doesn't satisfy our criteria:

/ .[ERv]|a.u|l.x|<i|eo|yp|ol|nf|tar|[C;ÿ-ﬀ]/.test

This scores 71.90% (697 starred, 741 unstarred).

I've been running some analyses on the lists to see which regex groups match the most starred and the least unstarred posts. The analyses can be found in this Gist. So far, I've checked aa and a.a matches. a.u is down at around #50 with a score of 28, yet it's the most efficient match of its format...

Pyth, 50 bytes, 67.9 %

0000000: 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46 69 b5 9d  !@jC"..]....\.Fi..
0000012: 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24 63 f8 bd  B.u.tq..y..].%$c..
0000024: 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 7a              .SE...1f_."2Cz

This hashes the input in one of 322 buckets and chooses the Boolean depending on that bucket.

Scoring

$ xxd -c 18 -g 1 startest.pyth
0000000: 72 53 6d 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46  rSm!@jC"..]....\.F
0000012: 69 b5 9d 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24  i..B.u.tq..y..].%$
0000024: 63 f8 bd 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 64 2e  c...SE...1f_."2Cd.
0000036: 7a 38                                                  z8
$ echo $LANG
en_US
$ pyth/pyth.py startest.pyth < starred.txt
[[345, False], [655, True]]
$ pyth/pyth.py startest.pyth < unstarred.txt
[[703, False], [297, True]]

Predict whether a message will be starred or not in 50 bytes

Retina, 50 bytes, 71.8% 72.15%

Analysis / explanation

JavaScript ES6, 50 bytes, 71.10%

Pyth, 50 bytes, 67.9 %

Scoring

Tags:

Classification

Test Battery

Related

Recent Posts