Wikipedia processing. PyPy vs CPython benchmark

Comments:"Wikipedia processing. PyPy vs CPython benchmark — Robert Zaremba Scale it blog"

URL:http://rz.scale-it.pl/2013/02/18/wikipedia_processing._PyPy_vs_CPython_benchmark.html

Lately I’ve done some data mining tasks on Wikipedia. It consist of:

processing enwiki-pages-articles.xml Wikipedia dump
storing pages and categories into mongodb
using redis for mapping category titles

I made a benchmark on a real tasks for CPython 2.7.3 and PyPy 2b. Libraries I used:

redis 2.7.2
pymongo 2.4.2

Furthermore CPython was supported by:

hiredis
pymongo c-extensions

The benchmark mostly involve databases processing so I fought I won’t have huge PyPy benefit (since CPython drivers are supported by C-extensions).

Below I will describe some interesting results

Filtering categories from enwiki.xml

To facilitate work with categories I needed to filter categories from enwiki-pages-articles.xml and store them back with the same xml format. For this task I used SAX parser, which in both PyPy and CPython is a wrapper around expat parser. expat is native complied package (in both PyPy and CPython).

The code is quiet simple:

classWikiCategoryHandler(handler.ContentHandler):"""Class which detecs category pages and stores them separately """ignored=set(('contributor','comment','meta'))def__init__(self,f_out):handler.ContentHandler.__init__(self)self.f_out=f_outself.curr_page=Noneself.curr_tag=''self.curr_elem=Element('root',{})self.root=self.curr_elemself.stack=Stack()self.stack.push(self.curr_elem)self.skip=0defstartElement(self,name,attrs):ifself.skip>0ornameinself.ignored:self.skip+=1returnself.curr_tag=nameelem=Element(name,attrs)ifname=='page':elem.ns=-1self.curr_page=elemelse:# we don't want to keep old pages in memoryself.curr_elem.append(elem)self.stack.push(elem)self.curr_elem=elemdefendElement(self,name):ifself.skip>0:self.skip-=1returnifname=='page':self.task()self.curr_page=Noneself.stack.pop()self.curr_elem=self.stack.top()self.curr_tag=self.curr_elem.tagdefcharacters(self,content):ifcontent.isspace():returnifself.skip==0:self.curr_elem.append(TextElement(content))ifself.curr_tag=='ns':self.curr_page.ns=int(content)defstartDocument(self):self.f_out.write("<root>\n")defendDocument(self):self.f_out.write("<\root>\n")print("FINISH PROCESSING WIKIPEDIA")deftask(self):ifself.curr_page.ns==14:self.f_out.write(self.curr_page.render())classElement(object):def__init__(self,tag,attrs):self.tag=tagself.attrs=attrsself.childrens=[]self.append=self.childrens.appenddef__repr__(self):return"Element {}".format(self.tag)defrender(self,margin=0):ifnotself.childrens:returnu"{0}<{1}{2} />".format(" "*margin,self.tag,"".join([' {}="{}"'.format(k,v)fork,vin{}.iteritems()]))ifisinstance(self.childrens[0],TextElement)andlen(self.childrens)==1:returnu"{0}<{1}{2}>{3}</{1}>".format(" "*margin,self.tag,"".join([u' {}="{}"'.format(k,v)fork,vin{}.iteritems()]),self.childrens[0].render())returnu"{0}<{1}{2}>\n{3}\n{0}</{1}>".format(" "*margin,self.tag,"".join([u' {}="{}"'.format(k,v)fork,vin{}.iteritems()]),"\n".join((c.render(margin+2)forcinself.childrens)))classTextElement(object):def__init__(self,content):self.content=contentdef__repr__(self):return"TextElement"defrender(self,margin=0):returnself.content

The Element and TextElement objects holds information about element tag and body, and provides a method to render it.

Here I expect similar result for both PyPy and CPython.

/ time PyPy 2169.90s CPython 4494.69s

I’m positively surprised with PyPy result.

Computing interesting categories set

I wanted to compute a interesting categories set - which, in my use case, consist of categories derived from Computing category. For this I needed to construct a category graph which will provide category - sub categories relation.

Construction category - sub categories relation

This task uses data from mongodb and redis constructed before. The algorithms is:

for each category.id in redis_categories (it holds *category.id -> category title mapping*) do:
 title = redis_categories.get(category.id)
 parent_categories = mongodb get categories for title
 for each parent_cat in parent categories do:
 redis_tree.sadd(parent_cat, title) # add to parent_cat set title

Sorry for this pseudo code, but I think it looks more compact.

So this task is just copying data between databases. The result here are made after mongodb warming up (without it the results are biased because mongodb latency - the python job uses only 10% CPU) The timing is:

/ time PyPy 175.11s user 66.11s system 64% cpu CPython 457.92s user 72.86s system 81% cpu

Another points to PyPy.

Traversing redis_tree

If we have redis_tree database, then only thing left is to traverse all nodes which are achievable from Computing category. To preserve falling into cycle we need to remember which categories we visited. Since I want to test Python for database tasks I used redis set field for this.

/ time PyPy 14.79s user 6.22s system 69% cpu 30.322 total CPython 44.20s user 13.86s system 71% cpu 1:20.91 total

To be honest this task also requires constructing some tabu list - to prevent jumping into unwanted categories. But this is not a point of this article.

Conclusions

Presented tasks are only introduction for my final work. It requires a knowledge base which I got by extracting appropriate articles from Wikipedia.

PyPy gives me 2-3 times performance boost compared to CPython for simple data bases operations (I’m not counting sql parser here, which is almost 8x).

Thanks PyPy my work was more pleasant - I got Python productivity without frustrating with waiting for results to correct my algorithms. Moreover PyPy doesn’t kill my CPU as CPython does so in a meantime I could normally use my laptop (check % CPU time usage).

The tasks was mostly database manipulation, and CPython has some speed-up from developers contribute into dirty C-extensions. PyPy doesn’t use them and at the end is faster!

All my work required a lot of cycles so I’m really happy with PyPy.

Wikipedia processing. PyPy vs CPython benchmark — Robert Zaremba Scale it blog

Filtering categories from enwiki.xml

Computing interesting categories set

Construction category - sub categories relation

Traversing redis_tree

Conclusions

Trending Articles

Sarah Samis, Emil Bove III

The Young & The Wealthy: Crimetown Podcast Points The Spotlight On Detroit’s...

Bureau of Internal Revenue: Regional Offices (Directory)

Practice Sheet of Pronoun References for HSC Students

Re: XXX esx.problem.hyperthreading.unmitigated.formatOnHost not found XXX...

Blackstone — Befi Mano (Throw Back Thursday)

Muloraki Au

Step by step MIM PAM setup and evaluation Guide – Part 2

NARSAPUR Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

Custom TAB in ML81N (Entry Sheet) Header

Efendi – Cleopatra – Single [iTunes Plus M4A]

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Black Angus Grilled Artichokes

Arrest logs for Wednesday, March 20, 2019

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

ZARIA CUMMINGS

99 Rain Status for Whatsapp - Best Rain Dp Collection

'Kamikaze' killer driver Ian Milligan loses appeal

Burnet County Jail bookings for the period of Nov. 27-Dec. 5

【急ぎ】LabVIEWのインストール中のエラー