Frequency rank of Russian words

Pursuant to my comments, here is a method to get all the frequencies by paginating requests to the WRI servers.

pages[l_, d_] := 
 Span @@@ NestList[{#[[2]], #[[2]] + d} &, {1, d}, Round[Length@l/d]]

words = WordList["KnownWords", Language -> "Russian"]

freqs = Join @@ 
   Table[WordFrequencyData[words[[p]], Language -> "Russian"], {p, 
     pages[words, 100]}];

This will get all the frequencies, 100 at a time (to prevent request timeouts). Sadly this takes approximately forever (which is why I suggest finding and downloading the source dataset for yourself, if you can).

I ran this for the first 1000 words in words with a page size of 100, which took about two minute, by doing

freqs = Join @@ 
   Table[WordFrequencyData[words[[p]], Language -> "Russian"], {p, 
     pages[words[[ ;; 1000]], 100]}];

Now your function is fairly simple: simply order the results by the frequency data and get the position of a word:

Position[Keys@ReverseSort@DeleteMissing@freqs, "была"][[1, 1]]

Of the first 1000 words, that is the 18th most common.

Or in function form:

russianWordFrequencyRank[w_] := 
 Position[Keys@ReverseSort@DeleteMissing@freqs, w][[1, 1]]

russianWordFrequencyRank["дан"]

which gives 357 - that is, it is the 357th most-common word in the Russian language that exists in the first 1000 words in words. If you create freqs based on all of words, you should get roughly the correct result.

Using Carl Lange's excellent idea of "paginating requests to prevent timeouts" and some psychological balm from Monitor, I downloaded all the data from Wolfram's server.

I then sorted it by rank and uploaded it github for future convenience.

data = Import["https://raw.githubusercontent.com/ccosnett/russianWordFrequenciesRanks/main/russianWordFrequencies.mx"]

enter image description here

russianWordFrequencyRank[w_] := Position[Import["https://raw.githubusercontent.com/ccosnett/russianWordFrequenciesRanks/main/russianWordFrequencies.mx"], {w, _, _}, 3][[1, 1]];

russianWordFrequencyRank["товарищ"]

1239

Frequency rank of Russian words

Tags:

Linguistics

Curated Data

Related

Recent Posts