Processing cluster upgraded

Thursday, June 15, 2017

Upgrade succeeded

As part of our continuous efforts to gradually improve the MEP, we carried out an upgrade of our processing cluster over the last 2 days. This was the first major upgrade to one of the core components of the platform, so this is a major milestone for the platform and the team that runs it.

As we rely on this system for operational services, and have a steadily growing userbase, we spent a while preparing this upgrade. This allowed us to practice the procedure, and identify possible issues up front. Nevertheless, we cannot test all different applications that our users are running, and there are still small, mostly unavoidable, differences between our testing system and the production system. So in case you are encountering issues after the upgrade, just let us know so we can look into it.

In case you haven't tried using the cluster before, more info can be found here and here.


Feature-wise, this upgrade, to Hortonworks data platform 2.5, brings two important improvements that are relevant to the MEP:

Spark 1.4 is upgraded to 1.6, 2.0 is also available side by side. Spark is the component that we recommend for implementing your distributed processing jobs. These new versions bring performance improvements and new features. Most visibly, the Spark UI has become more usable, giving even more insight into your running jobs.

Additional monitoring capabilities are now available, which should allow us to get a better view on the running system. This is critical as the number of (concurrent) users is growing. It should also allow us to get a better view on overall platform usage.

Next to these major improvements, a number of smaller issues are now also fixed. For instance, you can now use Spark actions inside Oozie workflows to more easily schedule your jobs. While this may seem limited, more incremental improvements will follow in the future, and as we gain experience, we should be able to minimize downtime for upgrades further.

Next steps

Now that this milestone is passed, we can look into the future. One thing to improve is certainly maximizing the use of our resources. When only a single user is running on the platform, we want him to benefit from having all resources available. However, when multiple users want to get their jobs running, we want resources to be distributed fairly. This is supported by scheduler features such as dynamic resource allocation and preemption. We will look into enabling and using those features as soon as possible.

In the more distant future, new releases of the platform will bring support for using virtual environments with pyspark, so that you can more easily use your own versions of python packages. Or what about running tensorflow jobs on the MEP? Or maybe using Docker to isolate your operational workflows? This platform is a solid foundation for any type of processing workload, so don't hesitate to let us know what you'd like to run on it!