My first demo was how to build a minimal Pyrex module that has a Python statement and a C call.
Then I talked about using Pyrex to optimize a simple Python program.
#!/usr/bin/python
from glob import glob
import itertools
class Bigrams:
"""original python version"""
def __init__(self):
self.pairs = {}
def add(self,word):
for i in range(len(word)-1):
pair = word[i]+word[i+1]
self.pairs[pair] = self.pairs.get(pair,0) + 1
def freqs(self):
"""get list of (pair,freq)"""
return self.pairs.items()
files = "/usr/include/*.h"
b = Bigrams()
for line in itertools.chain(*[file(x) for x in glob(files)]):
for word in line.split():
b.add(word)
freq_pair = [(v,k) for k,v in b.freqs()]
freq_pair.sort()
for fp in freq_pair[-8:]:
print "%8d %s" % fp
On my box, I get this list of the most popular bigrams:
% cat /proc/cpuinfo | egrep "vendor|MHz" vendor_id : AuthenticAMD cpu MHz : 1400.081 vendor_id : AuthenticAMD cpu MHz : 1400.081 % time ./bigrams 12210 te 12486 fi 13053 nt 13267 ef 13780 er 13940 ne 17057 de 30403 in ./bigrams 4.19s user 0.07s system 92% cpu 4.604 total
cdef class Bigrams4:
"""pack results into a fixed array like a C programmer would"""
cdef int pairs[65536]
def __init__(self):
for x in range(65536):
self.pairs[x]=0
def add(self,word):
cdef char *cword
cdef unsigned short pairidx
cword = word
for i from 0 <= i < len(word)-1:
pairidx = (cword[i]<<8) + cword[i+1]
self.pairs[pairidx] = self.pairs[pairidx] + 1
def freqs(self):
"""get list of (pair,freq)"""
letter_pairs = []
for x in range(65536):
if self.pairs[x]>0:
letter_pairs.append(("%s%s" % (chr(x >> 8), chr(x & 0xff)),
self.pairs[x]))
return letter_pairs
Note 'cdef class', which means this is a new C type. Pyrex writes this
C struct for the new type:
struct __pyx_obj_6bigram_Bigrams4 {
PyObject_HEAD
int (pairs[65536]);
};
My 'self.pairs' code, formerly an attribute lookup, is now a C struct
member access:
| Pyrex source: | self.pairs[pairidx] = self.pairs[pairidx] + 1 |
| C emitted from the Pyrex compiler: |
/* "/home/drewp/projects/digrams/bigram.pyx":51 */
(((struct __pyx_obj_6bigram_Bigrams4 *)__pyx_v_self)->pairs[__pyx_v_pairidx]) = ((((struct __pyx_obj_6bigram_Bigrams4 *)__pyx_v_self)->pairs[__pyx_v_pairidx]) + 1);
|
This runs in 1.2 sec, which I felt was good enough for my demo. Note that my rewritten algorithm could also be ported to plain Python. That version takes 3.7 sec-- the C datatypes made the big difference here.
Then, try "time bigrams" after setting line 29 to the class version you want to run. Edit line 25 to adjust which files that are scanned. The file "bigrams5" contains the fixed-array version written in Python, which I didn't want to have cluttering the original bigrams program.
These source files also contain a demo of calling uuid_generate from libuuid. Remove all the uuid code if you're on a platform that doesn't have that lib, I guess.