When processing a considerable amount of data, the resources of your VM or notebook will be insuffient, therefore we provided a distributed processing cluster with up to 50 physical machines. To run a job on this cluster, we recommend using Apache Spark.

Further info on writing and running a Spark job can be found here.

Command line interfaces

The main interface to the processing cluster is by using the command line tools on your virtual machine. Here's an overview of the most important tools to get you started:

  • spark-submit: submits Java, Scala or Python jobs to the cluster
  • yarn: allows you to retrieve logs from your jobs e.g. 'yarn logs -applicationId xxxxx'
  • hdfs: allows you to interact with the HDFS distributed file system

Cluster User Interfaces

The cluster also has some useful graphical interfaces:

Dashboard of running jobs:

Swiss army knife for all tools on the cluster: