PyEMD: Earth Mover's Distance for Python (and MATLAB) ===================================================== by Gary Doran () Overview -------- An efficient, accurate, easy-to-use EMD implementation in C with Python wrapper. New bonus MATLAB wrapper also included. Installation ------------ *Python:* this package can be installed in two ways (the easy way): # If needed: # pip install numpy # pip install scipy pip install -e git+https://github.com/garydoranjr/pyemd.git#egg=pyemd or by running the setup file manually git clone [the url for pyemd] cd pyemd python setup.py install Note the code requires Python 2 (Python 3 is not supported) and depends on the `numpy` and `scipy` packages. So have those installed first. The build will likely fail if it can't find them. For more information, see: + [NumPy](http://www.numpy.org/): Library for efficient matrix math in Python + [SciPy](http://www.scipy.org/): Library for more MATLAB-like functionality *MATLAB:* clone the repository and `cd` to the `matlab` subdirectory. Either set the `MATLABDIR` environment variable, or edit the first line of the Makefile to set the path to the desired MATLAB installation, and then run `make`. After the MEX file has compiled, add the `matlab` subdirectory to the MATLAB path (e.g., by using the `addpath` command in MATLAB). Why? ---- Several Python wrappers for C-based EMD implementations already exist, so why is another one necessary? There are two popular alternative approaches, each with their limitations: - *Solve a LP with GLPK:* Since the transportation problem is a special case of the general LP formulation, it can be solved more efficiently and with much less use of memory. This implementation is approximately 7-8 times faster than a GLPK-based solution, and uses about 500 MB of memory when the GLPK-based solution uses over 10 GB before it crashes my machine. These figures are from problems in which samples of size ~1000 with ~100 features are compared using the EMD for the multiple-instance learning problems I study. - *Wrap Yossi Rubner's Implementation:* There exist several wrappers of [Yossi Rubner's EMD code](http://robotics.stanford.edu/~rubner/emd/default.htm), the most popular of which is in the OpenCV library. The first limitation of this code is the use of single-precison versus double-precision floating point numbers. Another issue is a hard-coded `MAX_SIG_SIZE`, which limits the size of the samples that can be used in the EMD computation. PyEMD is a more "Pythonic" EMD implementation than other wrappers. Only the minimal amount of computation is done in C (the core transportation algorithm). This means that the distance computation is done in Python using the efficient SciPy library, and a custom, precomputed distance matrix can be easily provided. Usage ----- The EMD implementation can be used simply in Python as: >>> from emd import emd >>> emd(X, Y) where `X` and `Y` are each n-dimensional samples of points. Each argument should be a NumPy array with n columns, but possibly different numbers of rows. If the sample is weighted, the weights can be specified with optional `X_weights` and `Y_weights` arguments. By default, uniform weights are used. Because the EMD is a distance between probability measures, the *total weights of each of the two samples must sum to 1*. By default, the Euclidean distance between points is used. However, an optional argument `distance` takes a string that specifies a valid distance type accepted by the `scipy.spatial.cdist` function. Alternatively, if `distance='precomputed'`, then a precomputed distance matrix is expected to be supplied to the optional argument `D`. Finally, PyEMD can also return the flows between the two samples that are used to compute the distance. If the `return_flows` argument is `True`, then two arguments, the distance an array of the flows, are returned. See the docstring for a more formal description of the functionality. In MATLAB, the functionality is essentially the same; see the help for details. Citing ------ If you have used PyEMD for your research and would like to cite it, you can use the following BibTeX entry: @Misc{, author = {Gary Doran}, title = {{PyEMD}: Earth Mover's Distance for {Python}}, year = {2014--}, url = "https://github.com/garydoranjr/pyemd", note = {[Online; accessed ]} } Questions and Issues -------------------- If you find any bugs or have any questions about this code, please create an issue on [GitHub](https://github.com/garydoranjr/pyemd/issues), or contact Gary Doran at . Of course, I cannot guarantee any support for this software.