Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

You really are not going to be able to do this with a simplistic sed script. I’m assuming that you will want to reduce to “citation forms”, collapsing all inflections into a base form.

That means that adjectives like protégé, protégés, protégée, protégées all count as the same thing, the base adjective/participle protégé. Similarly, all inflections of the verb protéger — like protège, protégeons, protégeais, protégeasse, protégeâmes, protégeront, protégeraient, etc. — would all reduce to that base verb.

That means you need to know things about the inflectional morphology of the language. Even worse, you will need to understand something about the actual syntax of language, including for the inflections and to distinguish homographs.

I have done very simple approaches to at least the first part of this using Perl. It’s really rather a pain in the butt. Here’s a sample of code I used for generating sort keys for cities and towns on the Iberian peninsula:

       # 1st strip leading articles
          s/^L'//;    # Catalan
          s{ ^
            (?:
        # Castilian
                El
              | Los
              | La
              | Las

        # Catalan 
              | Els
              | Les         
              | Sa
              | Es

        # Gallego
              | O       
              | Os
              | A
              | As      
            ) 
            \s+ 
          }{}x;
        # 2nd strip interior particles
          s/\b[dl]'//g;   # Catalan
          s{ 
            \b
            (?:
                el  | los | la | las | de  | del | y          # ES
              | els | les | i  | sa | es | dels               # CA 
              | o   | os  | a  | as  | do  | da | dos | das   # GAL
            )
            \b
        }{}gx;

That strips the articles and particles so that they don’t count for purposes of sortation. But you will have to deal with forms like l’autre with a so-called curly-quote, which is really U+2019 RIGHT SINGLE QUOTATION MARK, the preferred form for the apostrophe. I normalized those into straight ones with a s/’/'/g first.

Oh, and you will have to deal with encodings: MacRoman is not the same as UTF-8 or ISO-8859-1 — not by a long shot.

Honestly, you probably want to use something like the Snowball stemming algorithm, specifying French as the language. Certainly Perl’s Lingua::Stem::Snowball module knows how to do this. You can search for Perl modules having to do with French linguistics using this query.

But stemming will only take you so far. You won’t really do a good job until you apply morphosyntactic analysis — which means you have to generate a parse for the sentences and assign parts of speech to each element there.

This requires much more work. The good news is that there are dedicated tools for this out there, some of which do indeed work on French. But this really is biting off a great deal, because now you’ve ventured into the fields of Natural Language Processing and Computational Linguistics. There is no great home for such questions here, but they might be probably better answered on Linguistics.SE; I don’t know.

Natural language processing is complex. Doing it with regular expressions is like parsing HTML with regular expressions, only worse. Read tchrist's excellent answer for some insight as to how to use approach your problem. I'm going to briefly answer the part about the portability of your use of unix text processing tools.

The common denominator to all modern unix-like systems is the POSIX specification. The most useful resource is the Open Group Specification Issue 6 a.k.a. Single Unix Specification version 3 (OGS Issue 7 = SUS version 4 is not fully implemented on many systems), which includes and extends POSIX and, usefully, is available online and for download (e.g. in Debian). If you're only interested in portability to non-embedded Linux (and Cygwin) and to OSX, check the GNU manuals and the OSX man pages.

You are using several non-POSIX options to grep, but all of them are available in both GNU and OSX (OSX uses the grep from FreeBSD which seeks to emulate most GNU constructs). If you want POSIX, you'll need to avoid a few options:

grep -h to suppress the file name: call grep on one file at a time, or pass the files to cat first.
grep -o to output only the matched part: use sed or awk instead.
grep -w to match only whole words: search for a pattern like (^|[^[:alnum:]])needle($|[^[:alnum:]]).

You are using one GNU-only construct in sed: the \L directive to lowercase a replacement in the s command. There's nothing like that in other sed implementations. In general, you can use awk instead: break down the input to isolate the string to replace and call tolower. To lowercase the whole input, call tr '[:upper:]' '[:lower:]'.

Frequency of words in non-English language text: how can I merge singular and plural forms etc.?

Tags:

Sed

Text Processing

Portability

Natural Language

Shell Script

Related

Recent Posts