Predict whether a message will be starred or not in 50 bytes
Retina, 50 bytes, 71.8% 72.15%
^.*([[CE;ಠ-ﭏ]|tar|ol|l.x|eo|a.u|pin|nu|o.f|"$)
Tried some regex golfing at @MartinBüttner's suggestion. This matches 704 starred messages and doesn't match 739 unstarred messages.
The ^.*( ... )
is to make sure that there is always either 0 or 1 match, since Retina outputs the number of matches by default. You can score the program on the input files by prepending m`
for multiline mode, then running
Retina stars.retina < starred.txt
and likewise for unstarred.txt
.
Analysis / explanation
I generated the above snippets (and many more) using a program, then selected the ones I wanted manually. Here's some intuition as to why the above snippets work:
C
: MatchesPPCG
,@CᴏɴᴏʀO'Bʀɪᴇɴ
E
: Matches@ETHproductions
,@El'endiaStarman
;
: Because the test cases are HTML, this matches<
and>
ಠ-ﭏ
: Matches a range of Unicode characters, most prominently forಠ_ಠ
and@Doorknob冰
tar
: Matches variations ofstar
,@El'endiaStarman
(again) and alsogravatar
which appears in the oneboxes posted by new posts botsol
: Matchesrel="nofollow"
which is in a lot of links and oneboxesl.x
: Matches@AlexA.
,@trichoplax
eo
: Mainly matchespeople
, but also three cases for@Geobits
a.u
: Mainly matchesgraduation
,status
,feature
andabuse
pin
: Matchesping
and words ending inping
. Also matches a few posts in a discussion aboutpineapple
, as an example of overfitting.nu
: Matches a mixed bag of words, the most common of which isnumber
o.f
: Matchesgolf
,conf(irm|use)
"$
: Matches a double quote as a last character, e.g.@phase He means "Jenga."
The [
is nothing special - I just had a character left over so I figured I could use it to match one more case.
JavaScript ES6, 50 bytes, 71.10%
Correctly identifies 670 starred and 752 non-starred.
x=>/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)
Now across the 70% barrier, and beating everyone except Retina!
Returns true
if the message contains any of these things:
- A word of which the second letter is
D
,E
,R
, orv
; tar
(usuallystar
);a
andu
with one char in between;l
andx
with one char in between (usuallyalex
);- italic text;
eo
orol
;- a
C
, a semicolon, or aಠ
.
Here's a few more fruitful matches that don't seem to be worth getting rid of others:
nf
nu
yp
n.m
This has been growing closer and closer to the Retina answer, but I have found most of the improvements on my own.
Test it out in the console of one of these pages: star texts, no-star texts
var r=document.body.textContent.replace(/\n<br/g,"<br").split("\n").slice(0,-1);
var s=r.filter(function(x){return/ .[DERv]|tar|a.u|l.x|<i|eo|ol|[C;ಠ]/.test(x)}).length;
console.log("Total:",r.length,"Matched:",s,"Not matched:",r.length-s);
Here's an alternate version. /a/.test
is technically a function, but doesn't satisfy our criteria:
/ .[ERv]|a.u|l.x|<i|eo|yp|ol|nf|tar|[C;ÿ-ff]/.test
This scores 71.90% (697 starred, 741 unstarred).
I've been running some analyses on the lists to see which regex groups match the most starred and the least unstarred posts. The analyses can be found in this Gist. So far, I've checked aa
and a.a
matches. a.u
is down at around #50 with a score of 28, yet it's the most efficient match of its format...
Pyth, 50 bytes, 67.9 %
0000000: 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46 69 b5 9d !@jC"..]....\.Fi..
0000012: 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24 63 f8 bd B.u.tq..y..].%$c..
0000024: 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 7a .SE...1f_."2Cz
This hashes the input in one of 322 buckets and chooses the Boolean depending on that bucket.
Scoring
$ xxd -c 18 -g 1 startest.pyth
0000000: 72 53 6d 21 40 6a 43 22 03 91 5d d3 c3 84 d5 5c df 46 rSm!@jC"..]....\.F
0000012: 69 b5 9d 42 9a 75 fa 74 71 d9 c1 79 1d e7 5d fc 25 24 i..B.u.tq..y..].%$
0000024: 63 f8 bd 1d 53 45 14 d7 d3 31 66 5f e8 22 32 43 64 2e c...SE...1f_."2Cd.
0000036: 7a 38 z8
$ echo $LANG
en_US
$ pyth/pyth.py startest.pyth < starred.txt
[[345, False], [655, True]]
$ pyth/pyth.py startest.pyth < unstarred.txt
[[703, False], [297, True]]