Classifying Text Based on Groups of Keywords?

You might want to look the category of "similarity measures" or "distance measures" (which is different, in data mining lingo, than "classification".)

Basically, a similarity measure is a way in math you can:

  1. Take two sets of data (in your case, words)
  2. Do some computation/equation/algorithm
  3. The result being that you have some number which tells you how "similar" that data is.

With similarity measures, this number is a number between 0 and 1, where "0" means "nothing matches at all" and "1" means "identical"

So you can actually think of your sentence as a vector - and each word in your sentence represents an element of that vector. Likewise for each category's list of keywords.

And then you can do something very simple: take the "cosine similarity" or "Jaccard index" (depending on how you structure your data.)

What both of these metrics do is they take both vectors (your input sentence, and your "keyword" list) and give you a number. If you do this across all of your categories, you can rank those numbers in order to see which match has the greatest similarity coefficient.

As an example:

From your question:

Customer Transactions: deposits, deposit, customer, account, accounts

So you could construct a vector with 5 elements: (1, 1, 1, 1, 1). This means that, for the "customer transactions" keyword, you have 5 words, and (this will sound obvious but) each of those words is present in your search string. keep with me.

So now you take your sentence:

The system shall apply deposits to a customer's specified account.

This has 2 words from the "Customer Transactions" set: {deposits, account, customer}

(actually, this illustrates another nuance: you actually have "customer's". Is this equivalent to "customer"?)

The vector for your sentence might be (1, 0, 1, 1, 0)

The 1's in this vector are in the same position as the 1's in the first vector - because those words are the same.

So we could say: how many times do these vectors differ? Lets compare:

(1,1,1,1,1) (1,0,1,1,0)

Hm. They have the same "bit" 3 times - in the 1st, 3rd, and 4th position. They only differ by 2 bits. So lets say that when we compare these two vectors, we have a "distance" of 2. Congrats, we just computed the Hamming distance! The lower your Hamming distance, the more "similar" the data.

(The difference between a "similarity" measure and a "distance" measure is that the former is normalized - it gives you a value between 0 and 1. A distance is just any number, so it only gives you a relative value.)

Anyway, this might not be the best way to do natural language processing, but for your purposes it is the simplest and might actually work pretty well for your application, or at least as a starting point.

(PS: "classification" - as you have in your title - would be answering the question "If you take my sentence, which category is it most likely to fall into?" Which is a bit different than saying "how much more similar is my sentence to category 1 than category 2?" which seems to be what you're after.)

good luck!


The main characteristics of the problem are:

  • Externally defined categorization criteria (keyword list)
  • Items to be classified (lines from the requirement document) are made of a relatively small number of attributes values, for effectively a single dimension: "keyword".
  • As defined, no feedback/calibrarion (although it may be appropriate to suggest some of that)

These characteristics bring both good and bad news: the implementation should be relatively straight forward, but a consistent level of accuracy of the categorization process may be hard to achieve. Also the small amounts of various quantities (number of possible categories, max/average number of words in a item etc.) should give us room to select solutions that may be CPU and/or Space intentsive, if need be.

Yet, even with this license got "go fancy", I suggest to start with (and stay close to) to a simple algorithm and to expend on this basis with a few additions and considerations, while remaining vigilant of the ever present danger called overfitting.

Basic algorithm (Conceptual, i.e. no focus on performance trick at this time)

   Parameters = 
     CatKWs = an array/hash of lists of strings.  The list contains the possible
              keywords, for a given category.
         usage: CatKWs[CustTx] = ('deposits', 'deposit', 'customer' ...)
     NbCats = integer number of pre-defined categories
   Variables:
      CatAccu = an array/hash of numeric values with one entry per each of the
                possible categories.  usage:  CatAccu[3] = 4 (if array) or 
                 CatAccu['CustTx'] += 1  (hash)
      TotalKwOccurences = counts the total number of keywords matches (counts
       multiple when a word is found in several pre-defined categories)

    Pseudo code:  (for categorizing one input item)
       1. for x in 1 to NbCats
            CatAccu[x] = 0    // reset the accumulators
       2. for each word W in Item
             for each x in 1 to NbCats
                 if W found in CatKWs[x]
                      TotalKwOccurences++
                      CatAccu[x]++
       3. for each x in 1 to NbCats
             CatAccu[x] = CatAccu[x] / TotalKwOccurences  // calculate rating
       4. Sort CatAccu by value
       5. Return the ordered list of (CategoryID, rating)
              for all corresponding CatAccu[x] values about a given threshold.

Simple but plausible: we favor the categories that have the most matches, but we divide by the overall number of matches, as a way of lessening the confidence rating when many words were found. note that this division does not affect the relative ranking of a category selection for a given item, but it may be significant when comparing rating of different items.

Now, several simple improvements come to mind: (I'd seriously consider the first two, and give thoughts to the other ones; deciding on each of these is very much tied to the scope of the project, the statistical profile of the data to be categorized and other factors...)

  • We should normalize the keywords read from the input items and/or match them in a fashion that is tolerant of misspellings. Since we have so few words to work with, we need to ensure we do not loose a significant one because of a silly typo.
  • We should give more importance to words found less frequently in CatKWs. For example the word 'Account' should could less than the word 'foo' or 'credit'
  • We could (but maybe that won't be useful or even helpful) give more weight to the ratings of items that have fewer [non-noise] words.
  • We could also include consideration based on digrams (two consecutive words), for with natural languages (and requirements documents are not quite natural :-) ) word proximity is often a stronger indicator that the words themselves.
  • we could add a tiny bit of importance to the category assigned to the preceding (or even following, in a look-ahead logic) item. Item will likely come in related series and we can benefit from this regularity.

Also, aside from the calculation of the rating per-se, we should also consider:

  • some metrics that would be used to rate the algorithm outcome itself (tbd)
  • some logic to collect the list of words associated with an assigned category and to eventually run statistic on these. This may allow the identification of words representative of a category and not initially listed in CatKWs.

The question of metrics, should be considered early, but this would also require a reference set of input item: a "training set" of sort, even though we are working off a pre-defined dictionary category-keywords (typically training sets are used to determine this very list of category-keywords, along with a weight factor). Of course such reference/training set should be both statistically significant and statistically representative [of the whole set].

To summarize: stick to simple approaches, anyway the context doesn't leave room to be very fancy. Consider introducing a way of measuring the efficiency of particular algorithms (or of particular parameters within a given algorithm), but beware that such metrics may be flawed and prompt you to specialize the solution for a given set at the detriment of the other items (overfitting).