[MO] How to industrialize a Hive data production chain

Zeppelin notebooks

We developed Spark Scala code on a Zeppelin notebook to be able to transform raw data into results.

To make quick checks, we have a Hue interface to execute direct SQL transactions from Hive.

Versioning the notebook

We started versioning the notebook codes in git to ensure we don't lose anything.

Raw data is updated everyday and we have new features to implement.

To deploy new output data, we launch manually the notebook.

Splitting the notebook into smaller steps

To be able to debug and test features faster, we split the notebook into multiple coherent notebooks that roughly look like:

  • Cleaning

  • Computation

  • Export output data to an Elasticsearch index

Each notebook produce an intermediary result stored in a table

Use Maven + Oozie to be independent from Zeppelin

To collaborate easily and be independent from Zeppelin interface, we created a Scala Maven project with our code versioned with git.

The project is split into multiple modules following our notebooks.

To deploy the code, we package it with maven, and upload the jar archives to HDFS.

To run the code, we created Oozie workflows that use our different archives.

Use different databases to test before deploying

We created a snapshot of the database and other Oozie Workflows that are not linked to our production data:

  • If the workflow fails or produce wrong data, there is no impact on production

  • The snapshot is not updated daily so we know what results to expect on our different steps and know if our modifications break something.

Use coordinators to automatically run our algorithms

To run automatically the code everyday when we receive new data, we created an Oozie Coordinator that use our workflows

Test locally

We used Hadoop Unit to write unit tests for our algorithms, so we can develop without deploying to the cluster.

We split our code to be able to isolate the spark conf and the side effects in the tests:

Version and deploy Oozie workflows

TODO

Future

Continuous deployment

Continuous integration

Last updated