
Scientific workflows play a vital role in modern science as they enable scientists to specify, share and reuse computational experiments. To maximizethe benefits, workflows need to support the reproducibility of the experimental methods they capture. Reproducibility enables effective sharing as scientists can re-execute experiments developed by others and quickly derive new or improved results. However, achieving reproducibility in practice is problematic — previous analyses highlight issues due to uncontrolled changes in the input data, configuration parameters, workflow description and the software used to implement the workflow tasks. The resulting problems have become known as workflow decay.
In this paper we present a novel framework that addresses workflow decay through the integration of system description, version control, container management and automated deployment techniques. It then introduces a set of performance optimization techniques that significantly reduce the runtime overheads caused by making workflows reproducible. The resulting system significantly improves the performance, repeatability and also the ability to share and re-use workflows by combining a method to uniquely identify task and workflow images with an automated image capture facility and a multi-level cache.
The system is evaluated through an extensive set of experiments that validate the approach and highlight the key benefits of the proposed optimizations. This includes methods for reducing the runtime of workflows by up to an order of magnitude in cases where they are enacted concurrently on the same host VM and in different Clouds, and where they share tasks.