When writing scripts for remote sensing, it is often advisable to take a few things into account from the beginning, to save work later on:
- Run/Test your code on a Linux distribution, it is not hard to write cross platform code in any modern language, but having to port your codebase after writing it is no fun.
- Prepare your code for distributed processing from day 1. Spark is perfectly suitable to write processing jobs that run locally, and is quite easy to learn. This again avoids costly refactoring work at a later point.
- Try running your code on the cluster as soon as possible. You will learn a number of very useful things by doing so, and will know what to avoid in the future.
Running a Spark job on the cluster can be done using the spark-submit command, specifying 'yarn-cluster' as master. More info on writing Spark jobs can be found here.
The JobControl application also provides support for definition and submission of PySpark and Java Spark jobs.
Under menu item Jobs > New, a new job can be defined, specifying the same parameters as can be found when submitting a job using the spark-submit command. Currently, the application files must exist on the HDFS filesystem.
At the bottom, you can find two input boxes. The box on the left can be used to describe the input parameters for the job. The box on the right can be used to override the default form that is generated when submitting the job, using bootstrap schema forms.
When the job is defined, it can be started. A job can also be shared within a project, so people working on the same project can share jobs. Therefore, we can define one or more projects and assign user accounts to them. Then, you can share your jobs with one or more projects. If you want a project to be set up, let us know.