Hungarian alphabetical order
Java 8, 742 bytes
Could reduce by another 3 bytes naming the function s
instead of sort
or another 16 bytes if not counting class-definition.
public class H{String d="cs|dzs?|gy|ly|sz|ty|zs";void sort(java.util.List<String>l){l.sort((a,b)->{String o="-a-á-b-cs-dzs-e-é-f-gy-h-i-í-j-k-ly-m-ny-o-ó-ö-ő-p-q-r-sz-ty-u-ú-ü-ű-v-w-x-y-zs-";int i=c(r(a),r(b),r(o));return i!=0?i:(i=c(a,b,o))!=0?i:b.charAt(0)-a.charAt(0);});}String r(String a){for(int i=0;i<8;i++)a=a.toLowerCase().replace("ááéíóőúű".charAt(i),"aaeioöuü".charAt(i));return a;}int c(String a,String b,String o){a=n(a);b=n(b);while(!"".equals(a+b)){int i=p(a,o),j=p(b,o);if(i!=j)return i-j;a=a.substring(i%4);b=b.substring(j%4);}return 0;}int p(String a,String o){a=(a+1).replaceAll("("+d+"|.).*","-$1");return o.indexOf(a)*4+a.length()-1;}String n(String a){return a.toLowerCase().replaceAll("(.)(?=\\1)("+d+")| |-","$2$2");}}
Can be used like this:
new H().sort(list);
Test-suite:
public static void main(String[] args) {
test(Arrays.asList("cudar", "cukor", "cuppant", "csalit", "csata"));
test(Arrays.asList("kasza", "kaszinó", "kassza", "kaszt", "nagy", "naggyá", "nagygyakorlat", "naggyal",
"nagyít"));
test(Arrays.asList("jácint", "Jácint", "Zoltán", "zongora"));
test(Arrays.asList("Eger", "egér", "író", "iroda", "irónia", "kerek", "kerék", "kérek", "szúr", "szül"));
test(Arrays.asList("márvány", "márványkő", "márvány sírkő", "Márvány-tenger", "márványtömb"));
}
private static void test(final List<String> input) {
final ArrayList<String> random = randomize(input);
System.out.print(input + " -> " + random);
new H().sort(random);
System.out.println(" -> " + random + " -> " + input.equals(random));
}
private static ArrayList<String> randomize(final List<String> input) {
final ArrayList<String> temp = new ArrayList<>(input);
final ArrayList<String> randomOrder = new ArrayList<>(input.size());
final Random r = new Random();
for (int i = 0; i < input.size(); i++) {
randomOrder.add(temp.remove(r.nextInt(temp.size())));
}
return randomOrder;
}
yielding
[cudar, cukor, cuppant, csalit, csata] -> [csata, cudar, cuppant, csalit, cukor] -> [cudar, cukor, cuppant, csalit, csata] -> true
[kasza, kaszinó, kassza, kaszt, nagy, naggyá, nagygyakorlat, naggyal, nagyít] -> [naggyá, kassza, kaszinó, nagygyakorlat, nagyít, nagy, kaszt, kasza, naggyal] -> [kasza, kaszinó, kassza, kaszt, nagy, naggyá, nagygyakorlat, naggyal, nagyít] -> true
[jácint, Jácint, Zoltán, zongora] -> [Zoltán, jácint, zongora, Jácint] -> [jácint, Jácint, Zoltán, zongora] -> true
[Eger, egér, író, iroda, irónia, kerek, kerék, kérek, szúr, szül] -> [egér, Eger, kerék, iroda, író, kerek, kérek, szúr, irónia, szül] -> [Eger, egér, író, iroda, irónia, kerek, kerék, kérek, szúr, szül] -> true
[márvány, márványkő, márvány sírkő, Márvány-tenger, márványtömb] -> [márványtömb, márványkő, Márvány-tenger, márvány sírkő, márvány] -> [márvány, márványkő, márvány sírkő, Márvány-tenger, márványtömb] -> true
Ungolfed:
public class HungarianOrder {
String d = "cs|dzs?|gy|ly|sz|ty|zs";
void sort(java.util.List<String> l) {
l.sort((a, b) -> {
String o = "-a-á-b-cs-dzs-e-é-f-gy-h-i-í-j-k-ly-m-ny-o-ó-ö-ő-p-q-r-sz-ty-u-ú-ü-ű-v-w-x-y-zs-";
int i = c(r(a), r(b), r(o));
return i != 0 ? i
: (i = c(a, b, o)) != 0 ? i
: b.charAt(0) - a.charAt(0);
});
}
// toLower + remove long accent
String r(String a) {
for (int i = 0; i < 8; i++)
a = a.toLowerCase().replace("ááéíóőúű".charAt(i), "aaeioöuü".charAt(i));
return a;
}
// iterate over a and b comparing positions of chars in o
int c(String a, String b, String o) {
a = n(a);
b = n(b);
while (!"".equals(a + b)) {
int i = p(a, o), j = p(b, o);
if (i != j)
return i - j;
a = a.substring(i % 4);
b = b.substring(j % 4);
}
return 0;
}
// find index in o, then looking if following characters match
// return is index * 4 + length of match; if String is empty or first character is unknown -1 is returned
int p(String a, String o) {
a = (a+1).replaceAll("("+d+"|.).*", "-$1");
return o.indexOf(a) * 4 + a.length() - 1;
}
// expand ddz -> dzdz and such
String n(String a) {
return a.toLowerCase().replaceAll("(.)(?=\\1)("+ d +")| |-", "$2$2");
}
}
I am using Java's List
-type and the order()
-function of it, but the comparator is all mine.
Perl, 250
Includes +11 for -Mutf8 -CS
.
use Unicode::Normalize;$r="(?=cs|zs|dz|sz|[glnt]y)";print map/\PC*
/g,sort map{$d=$_;s/d\Kd(zs)|(.)\K$r\2(.)/\L$+\E$&/gi;s/d\Kzs/~$&/gi;s/$r.\K./~$&/gi;s/(\p{Ll}*)(\w?)\s*-*/\U$1\L$2/g;$c=$_;$b=$_=NFD lc;y/̈̋/~~/d;join$;,$_,$b,$c,$d}<>
Uses the decorate-sort-undecorate idiom (AKA Schwartzian Transform), and multilevel sorting†, where the levels are:
- L1: compare base letters, ignore diacritics, case, and some punctuation.
- L2: compare base letters and diacritics, ignore case and some punctuation.
- L3: compare base letters, diacritics and case, ignore some punctuation.
- Ln: tie-breaking byte-level comparison.
Internally, ␜
(ASCII 0x1C Field Separator — whose value is less than any character in the alphabet for this challenge) is used as a level separator.
This implementation has many limitations, amongst them:
- No support for foreign characters.
- Cannot disambiguate between contracted geminated (long) digraphs/trigraphs, and consonant+digraph/trigraph, e.g: könnyű should collate as <k><ö><ny><ny><ű>, while tizennyolc should collate as <t><i><z><e><n><ny><o><l><c>; házszám 'address = house (ház) number (szám)' should collate as <h><á><z><sz><á><m> and not as *<h><á><zs><z><á><m>.‡
- Collation for contracted long digraphs is not that consistent (but it is stable): we disambiguate at the identical level (ssz <n szsz, ..., zszs <n zzs ); glibc collates the short forms before the full forms (ssz < szsz, ..., zzs < zszs ), ICU collates the long forms before the short forms starting at L3 Case and Variants (szsz <3 ssz, ..., zszs <3 zzs )
Expanded version:
use Unicode::Normalize;
$r="(?=cs|zs|dz|sz|[glnt]y)"; # look-ahead for digraphs
print map/\PC*\n/g, # undecorate
sort # sort
map{ # decorate
$d=$_; # Ln: identical level
# expand contracted digraphs and trigraphs
s/d\Kd(zs)|(.)\K$r\2(.)/\L$+\E$&/gi;
# transform digraphs and trigraphs so they
# sort correctly
s/d\Kzs/~$&/gi;s/$r.\K./~$&/gi;
# swap case, so lower sorts before upper
# also, get rid of space, hyphen, and newline
s/(\p{Ll}*)(\w?)\s*-*/\U$1\L$2/g;
$c=$_; # L3: Case
$b=$_=NFD lc; # L2: Diacritics
# transform öő|üű so they sort correctly
# ignore diacritics (acute) at this level
y/\x{308}\x{30b}\x{301}/~~/d;
# L1: Base characters
join$;,$_,$b,$c,$d
}<>
†. Some well-known multi-level collation algorithms are the Unicode Collation Algorithm (UCA, Unicode UTS#10), ISO 14651 (available at the ISO ITTF site) the LC_COLLATE parts at ISO TR 30112 (draft available at the ISO/IEC JTC1/SC35/WG5 home) which obsoletes ISO/IEC TR 14652 (available at the ISO/IEC JTC1/SC22/WG20 home) and LC_COLLATE at POSIX.
‡. Doing this correctly would require a dictionary. ICU treats weirdly capitalized groups as non-contractions/non-digraphs/non-trigraphs, e.g: ccS <3 CcS <3 cCs <3 cCS <3 CCs <3 cS <3 cs <3 Cs <3 CS <3 ccs <3 Ccs <3 CCS
Python 3, 70
Saved 8 bytes thanks to shooqie.
I love Python. :D
Expects a list of strings.
from locale import*;setlocale(0,'hu')
f=lambda x:sorted(x,key=strxfrm)