Do you have computational problems that take hours, if not days, to solve? You can often distribute your work over a cluster or cloud of computers to solve the problem in only minutes.
This tutorial will teach various ways to distribute python-based computation. Tools covered include Hadoop, Google AppEngine, Sun GridEngine, PiCloud, Hadoop, and Elastic MapReduce.
Attendees should bring a laptop with Python 2.x (x>=6) installed as the tutorial is example based.
Intermediate level Python programmers. While no familiarity with distributed computing is assumed, programmers should be very comfortable reading Python code. Familiarity with scientific programming (e.g. numpy, scipy) helps but is not a must.
Class Size: Ideal, 20. Up to 30
* Introduction to distributed computing
o Types of parallelizable problems
* Low-level primitives
* Job processing on own cluster
o Oracle (Sun) Grid Engine
* Cloud Computing Solutions
o Google AppEngine
* MapReduce for large data
o Hadoop (using dumbo for python)
o Elastic MapReduce (overview only)
* Benchmarks showing how different problems perform on each system
Examples used in presentation include:
* Parallelizing Support Vector Machine training (Python’s libsvm wrapper) across a hundred nodes
* Determining features in brain waves using NumPy and distributed computing
* Using NumPy, SciPy, and lots of computers for analyzing data from human cells.
* The classic MapReduce distributed grep.