Detecting programming language from a snippet

Language detection solved by others:

Ohloh's approach: https://github.com/blackducksw/ohcount/

Github's approach: https://github.com/github/linguist


I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

Let me find the code.

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

I hope it helps you:

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @totals[lang] += 1
    @data[lang] ||= Hash.new(1)
    words(code).each {|w| @data[lang][w] += 1 }
  end

  def classify(code)
    ws = words(code)
    @data.keys.max_by do |lang|
      # We really want to multiply here but I use logs 
      # to avoid floating point underflow
      # (adding logs is equivalent to multiplication)
      Math.log(@totals[lang]) +
      ws.map{|w| Math.log(@data[lang][w])}.reduce(:+)
    end
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)

Guesslang is a possible solution:

http://guesslang.readthedocs.io/en/latest/index.html

There's also SourceClassifier:

https://github.com/chrislo/sourceclassifier/tree/master

I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".