Creating an ARFF file from python output
I know it's pretty easy to generate an arff file on your own, but I still wanted to make it simpler so I wrote a python package
https://github.com/ubershmekel/arff
It's also on pypi so easy_install arff
There are details on the ARFF file format here and it's very simple to generate. For example, using a cut-down version of your Python dictionary, the following script:
import re
d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html':
{'dail': 1,
'focus': 1,
'actions': 1,
'trade': 2,
'protest': 1,
'identify': 1 }}
for original_filename in d.keys():
m = re.search('^(.*)\.html$',original_filename,)
if not m:
print "Ignoring the file:", original_filename
continue
output_filename = m.group(1)+'.arff'
with open(output_filename,"w") as fp:
fp.write('''@RELATION wordcounts
@ATTRIBUTE word string
@ATTRIBUTE count numeric
@DATA
''')
for word_and_count in d[original_filename].items():
fp.write("%s,%d\n" % word_and_count)
Generates output of the form:
@RELATION wordcounts
@ATTRIBUTE word string
@ATTRIBUTE count numeric
@DATA
dail,1
focus,1
actions,1
trade,2
protest,1
identify,1
... in a file called gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff
. If that's not exactly what you want, I'm sure you can easily alter it. (For example, if the "words" might have spaces or other punctuation in them, you probably want to quote them.)