My first demo was how to build a minimal Pyrex module that has a Python statement and a C call.
Then I talked about using Pyrex to optimize a simple Python program.
#!/usr/bin/python from glob import glob import itertools class Bigrams: """original python version""" def __init__(self): self.pairs = {} def add(self,word): for i in range(len(word)-1): pair = word[i]+word[i+1] self.pairs[pair] = self.pairs.get(pair,0) + 1 def freqs(self): """get list of (pair,freq)""" return self.pairs.items() files = "/usr/include/*.h" b = Bigrams() for line in itertools.chain(*[file(x) for x in glob(files)]): for word in line.split(): b.add(word) freq_pair = [(v,k) for k,v in b.freqs()] freq_pair.sort() for fp in freq_pair[-8:]: print "%8d %s" % fpOn my box, I get this list of the most popular bigrams:
% cat /proc/cpuinfo | egrep "vendor|MHz" vendor_id : AuthenticAMD cpu MHz : 1400.081 vendor_id : AuthenticAMD cpu MHz : 1400.081 % time ./bigrams 12210 te 12486 fi 13053 nt 13267 ef 13780 er 13940 ne 17057 de 30403 in ./bigrams 4.19s user 0.07s system 92% cpu 4.604 total
cdef class Bigrams4: """pack results into a fixed array like a C programmer would""" cdef int pairs[65536] def __init__(self): for x in range(65536): self.pairs[x]=0 def add(self,word): cdef char *cword cdef unsigned short pairidx cword = word for i from 0 <= i < len(word)-1: pairidx = (cword[i]<<8) + cword[i+1] self.pairs[pairidx] = self.pairs[pairidx] + 1 def freqs(self): """get list of (pair,freq)""" letter_pairs = [] for x in range(65536): if self.pairs[x]>0: letter_pairs.append(("%s%s" % (chr(x >> 8), chr(x & 0xff)), self.pairs[x])) return letter_pairsNote 'cdef class', which means this is a new C type. Pyrex writes this C struct for the new type:
struct __pyx_obj_6bigram_Bigrams4 { PyObject_HEAD int (pairs[65536]); };My 'self.pairs' code, formerly an attribute lookup, is now a C struct member access:
Pyrex source: | self.pairs[pairidx] = self.pairs[pairidx] + 1 |
C emitted from the Pyrex compiler: |
/* "/home/drewp/projects/digrams/bigram.pyx":51 */ (((struct __pyx_obj_6bigram_Bigrams4 *)__pyx_v_self)->pairs[__pyx_v_pairidx]) = ((((struct __pyx_obj_6bigram_Bigrams4 *)__pyx_v_self)->pairs[__pyx_v_pairidx]) + 1); |
This runs in 1.2 sec, which I felt was good enough for my demo. Note that my rewritten algorithm could also be ported to plain Python. That version takes 3.7 sec-- the C datatypes made the big difference here.
Then, try "time bigrams" after setting line 29 to the class version you want to run. Edit line 25 to adjust which files that are scanned. The file "bigrams5" contains the fixed-array version written in Python, which I didn't want to have cluttering the original bigrams program.
These source files also contain a demo of calling uuid_generate from libuuid. Remove all the uuid code if you're on a platform that doesn't have that lib, I guess.