Random ...
December 2017
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Tags ...

python+zope++:: Smarter apps with reverend
Posted at 26.Feb,2009 21:55  Comments 0 / Trackbacks 0 / Like this post!
Technorati tag(s):

Wouldn't it be great if we can classify or categorize articles automatically? We've been doing this since spam became a standard feature of online life. Yup, maybe we can use bayesian probablity to guess an article's inclination.

At work, we've got lots of articles tagged or categorized manually so that we can generate related articles. If we can prepare a corpora of articles based on their inclinations and train and guess newer articles based on these data, maybe we can do this automatically. Granted it won't be 100% correct, but it's a start.

That's where the reverend python package comes in; where we can put Bayesian smarts in our apps.

I also looked at nltk, which is the right tool, methinks, provided that I can wrap my brain around it :P. So, while waiting for that to happen, I might as well use some of the methods with reverend.

Here's what I did:

  • prepare the corpora from my zope site. This is easily done with search. To be specific, we can use AND or OR operator, with () thrown in. For example, to classify articles with politics inclination, I searched for:
       bn OR pr OR umno OR pas OR dap OR mic OR mca OR pkr #which will search any of those political parties.
       football AND (johor OR selangor) #which will match all articles with foot and johor or selangor 
       note, capitalization of OR/AND matters
  • use wget to get all articles.
  • start training the corpora. Here's the quick script I used to train the corpora:
     import stripogram, os, sys
     from reverend.thomas import Bayes
     from nltk.corpus import stopwords
     import nltk
     def convertText(f):
        """convert html file object to text, and take out all stopwords"""
        slist = stopwords.words('english')     
        x = stripogram.html2text(f.read(), ignore_tags=('img',))
        nostopwordtext = ' '.join([i for i in nltk.word_tokenize(x) if i not in slist])
        return nostopwordtext
     if __name__ == "__main__":
        l = os.listdir('.')
        g = Bayes()
        for myfile in l:
            f = open(myfile)
            ctext = convertText(f)
            print ctext
            g.train('sport', ctext)  #the category for this run, manually inserted.  
                                           # this is a quick script :P
        print 'saving...'

Once we train all our corpora and give the appropriate category, we can start testing.

 In [1]: from reverend.thomas import Bayes
 In [2]: g=Bayes()
 In [3]: g.load('/tmp/base.bayes')
 In [14]: g.
 g.__class__         g.__repr__          g.dirty             g.removePool
 g.__delattr__       g.__setattr__       g.getProbs          g.renamePool
 g.__dict__          g.__str__           g.getTokens         g.robinson  
 g.__doc__           g.__weakref__       g.guess             g.robinsonFisher
 g.__getattribute__  g._tokenizer        g.load              g.save          
 g.__hash__          g._train            g.mergePools        g.train         
 g.__init__          g._untrain          g.newPool           g.trainCount    
 g.__len__           g.buildCache        g.poolData          g.trainedOn     
 g.__module__        g.combiner          g.poolNames         g.untrain       
 g.__new__           g.commit            g.poolProbs                         
 g.__reduce__        g.corpus            g.poolTokens                        
 g.__reduce_ex__     g.dataClass         g.pools 
 In [20]: testf="""KUALA LUMPUR, Mon:
   ....:     Yang di-Pertuan Agong Tuanku Mizan Zainal Abidin today reminded the people to respect the federal constitution and expressed the hope that there will be no attempts to create laws that contravene it.                                                                                        
   ....:     He said the history of the country's independence and the federal constitution must be explained to the young so that they would have a better understanding of the basis for the formation of the country.                                                                                    
   ....:     "The young generation is the country's back-up and hope for the future. The principles of the Rukun Negara must be understood and appreciated by all strata of the society," he said when opening the Dewan Rakyat sitting here.                                                               
   ....:     Tuanku Mizan expressed regret that despite the country having gained independence for 51 years, certain parties were still raising narrow racial issues for public debate.                 
   ....:     "I want to stress that my government will not hesitate to take action against anyone who tries to disunite the people, to ensure that racial harmony and peace in the country are maintained," he said to applause from the members of Parliament. """                                         

 In [21]: g.guess(testf)
 [('politics', 0.33878603481067071),
 ('national', 0.15704945244227841),
 ('sport', 0.11483068763601062)]   

 In [22]: testf="""KANGAR, Sun:
   ....:     A senior police officer was shocked when she found an infant inside a cupboard drawer at her home in Taman Fauziah near here today.                                                        
   ....:     The girl had her umbilical cord intact.                                                
   ....:     The female officer, who declined to be identified, said she found the baby in her Indonesian maid’s room when she whent to investigate the source of incessant crying about 9.30am.        
   ....:     When questioned, the 20-year-old maid claimed that she found the baby outside on Friday, while taking out the garbage.                                                                     
   ....:     The maid told the officer that she mistook a bundle of clothes for garbage before realising the baby was wrapped in it.                                                                    
   ....:     The officer has lodged a police report at the Kangar police station.                   
   ....:     Kangar police chief Supt Yusof Mohd Diah said the baby was sent to the Tuanku Fauziah Hospital for examination as she has not been fed milk for several days.                              
   ....:     He said the maid would also be sent to the hospital to determine whether she was the baby’s mother, adding that the case would be investigated under Section 317 of the Penal Code for exposure and abandonment of a child under 12 years. —  BERNAMA """                                       

 In [23]: g.guess(testf)
 [('national', 0.46357716353401907),
 ('politics', 0.34181058354823524),
 ('sport', 0.28431728232132303)]   

One thing I discovered is that the samples trrained should be evenly distributed, otherwise the result would skewed a bit.

Bookmark and Share

Is this entry helpful? Comments/Donate/Click some google ads.  
Trackback is http://myzope.kedai.com.my/blogs/kedai/229/tbping 

Post a comment