When using numpy efficiently you often end up using broadcasting operations on large matrices to e.g. calculate a difference over all possible combinations of feature vectors and then take the minimum.
This kind of thing is very CPU efficient compared to looping and calculating in Python as all the looping etc. is handled by numpy's internal libraries but it can be memory innefficient (I keep on running out of RAM on our server which has 32Gb of the stuff...) and also as a monolithic operation doesn't make use of multiple threads (it also has 16 cores so I want to use them all).
Anyway I discovered the following usage pattern for easily splitting up a big matrix operation in to smaller chunks so it can fit in to memory and parallelising the operation (which is one of the big benefits of having all the heavy lifting done in numpy's C code rather than in GIL encumbered python)
First off you decide how you are going to split things up. So something like this:
cs = 10
Means I am going to split things up in to 10 column chunks for processing.
Then define a functor or lambda expression that operates on the data you want to process taking a start index as the parameter:
f = lambda x : log(A[x:x+cs] + B[newaxis])**2
Note that this is baking A, B and cs in to the expression - you could wrap things up in a functor and then specify A, B and cs in your final expression, but I can't be bothered.
finally you run parallel_map on your function and stitch together the results:
nt = 10
res = vstack(parallel_map(f,range(0,len(A),cs), nt))
This would split it over 10 threads. If you just want the parallelism speed up you set cs = len(A)/nt. If you want memory and speed optimisation you set nt and cs independently. If (for memory reasons) you want to further subdivide your problem you can use map within your lambda function to split up the indexing of B
Finally, vstack will do something wierd to your results if you are iterating over a single item each time so you may need to do something like this:
f = lambda x : [log(A[x] + B[newaxis])**2]
so you get a bunch of single entry lists which vstack will stack correctly.
So there you go - happy and efficient processing!