Frequency rank of Russian words
Pursuant to my comments, here is a method to get all the frequencies by paginating requests to the WRI servers.
pages[l_, d_] :=
Span @@@ NestList[{#[[2]], #[[2]] + d} &, {1, d}, Round[Length@l/d]]
words = WordList["KnownWords", Language -> "Russian"]
freqs = Join @@
Table[WordFrequencyData[words[[p]], Language -> "Russian"], {p,
pages[words, 100]}];
This will get all the frequencies, 100 at a time (to prevent request timeouts). Sadly this takes approximately forever (which is why I suggest finding and downloading the source dataset for yourself, if you can).
I ran this for the first 1000 words in words
with a page size of 100, which took about two minute, by doing
freqs = Join @@
Table[WordFrequencyData[words[[p]], Language -> "Russian"], {p,
pages[words[[ ;; 1000]], 100]}];
Now your function is fairly simple: simply order the results by the frequency data and get the position of a word:
Position[Keys@ReverseSort@DeleteMissing@freqs, "была"][[1, 1]]
Of the first 1000 words, that is the 18th most common.
Or in function form:
russianWordFrequencyRank[w_] :=
Position[Keys@ReverseSort@DeleteMissing@freqs, w][[1, 1]]
russianWordFrequencyRank["дан"]
which gives 357 - that is, it is the 357th most-common word in the Russian language that exists in the first 1000 words in words
. If you create freqs
based on all of words
, you should get roughly the correct result.
Using Carl Lange's excellent idea of "paginating requests to prevent timeouts" and some psychological balm from Monitor
, I downloaded all the data from Wolfram's server.
I then sorted it by rank and uploaded it github for future convenience.
data = Import["https://raw.githubusercontent.com/ccosnett/russianWordFrequenciesRanks/main/russianWordFrequencies.mx"]
russianWordFrequencyRank[w_] := Position[Import["https://raw.githubusercontent.com/ccosnett/russianWordFrequenciesRanks/main/russianWordFrequencies.mx"], {w, _, _}, 3][[1, 1]];
.
russianWordFrequencyRank["товарищ"]
1239