This page contains some important tips when working with Python.
We preinstalled and configured Python 2.7.8 on all user VMs and on the processing cluster. This is currently the default version. Python 3.5 support is also available, on the VM's and the cluster.
To switch to the Python 3.5 environment, you should run:
scl enable rh-python35 bash
Once inside this environment, all commands will default to using Python 3.5.
Even though we already provide a number of popular Python packages by default, you probably need to install some new ones at some point. We recommend doing this using the pip package manager. For example:
pip install --user owslib
Installs the owslib library. The '--user' argument is required to avoid needing root permissions, and to ensure that the correct Python version is used. Do not use the yum package manager to install packages!
More advanced options are explained here: https://proba-v-mep.esa.int/documentation/manuals/python-package-management
On the processing cluster
To run your code on the cluster, you need to make sure that all dependencies are also available. It is however not possible to install specific packages on the cluster, but feel free to request the installation of a specific package.
If more freedom is needed, it is also possible to submit your dependencies together with your application code. This is simplified if you are already using a Python virtual environment for your development.
At time of writing, these packages are installed on the cluster, and user vms:
distribute, pandas, numpy, scipy, matplotlib, sympy, nose, seaborn, rasterio, requests,python-dateutil, pyproj, netCDF4,Pillow, lxml, gdal
Two sample Python/Spark projects are available on Bitbucket; they show how to use Spark (http://spark.apache.org/) for distributed processing on the PROBA-V Mission Exploitation Platform.
The basic code sample implements an (intentionally) very simple computation: for each PROBA-V tile in a given bounding box and time range, a histogram is computed. The results are then summed and printed. The computation of the histograms runs in parallel.
The project's README file includes additional information on how to run it and inspect its output.
The advanced code sample implements a more complex computation: it will calculate the mean of a time series of PROBA-V tiles and output it as a new set of tiles. The tiles are being split up into sub-tiles for increased parallelism. A mean is just an example an operation that can be applied to a time series.
Application specific files
When your script depends on specific files, there are a few options:
- Use the --files and --archives options of the spark-submit command to put local files in the working directory of your program, as explained here.
- Put the files on a network drive, as explained here.
- Put the files on the HDFS shared filesystem, where they can be read directly from Spark, or can be placed into your working directory using --files or --archives.
The second option is mostly recommended for larger files. The other options are quite convenient for distributing various smaller files. Spark has a caching mechanism in place to avoid unneeded file transfers across multiple similar runs.
When using compiled code, such as C++, code compiled on your VM will also work on the cluster. Hence you can safely use the compilation instructions of your tool to compile a binary, and then distribute it in a similar way as any other type of auxiliary data. In your script, you may need to configure environment variables such as PATH and LD_LIBRARY_PATH to ensure that your binaries can be found and are being used.