[MO] How to industrialize a Hive data production chain
Zeppelin notebooks
We developed Spark Scala code on a Zeppelin notebook to be able to transform raw data into results.
To make quick checks, we have a Hue interface to execute direct SQL transactions from Hive.
Versioning the notebook
We started versioning the notebook codes in git to ensure we don't lose anything.
Raw data is updated everyday and we have new features to implement.
To deploy new output data, we launch manually the notebook.
Splitting the notebook into smaller steps
To be able to debug and test features faster, we split the notebook into multiple coherent notebooks that roughly look like:
Cleaning
Computation
Export output data to an Elasticsearch index
Each notebook produce an intermediary result stored in a table
Use Maven + Oozie to be independent from Zeppelin
To collaborate easily and be independent from Zeppelin interface, we created a Scala Maven project with our code versioned with git.
The project is split into multiple modules following our notebooks.
To deploy the code, we package it with maven, and upload the jar archives to HDFS.
To run the code, we created Oozie workflows that use our different archives.
Use different databases to test before deploying
We created a snapshot of the database and other Oozie Workflows that are not linked to our production data:
If the workflow fails or produce wrong data, there is no impact on production
The snapshot is not updated daily so we know what results to expect on our different steps and know if our modifications break something.
Use coordinators to automatically run our algorithms
To run automatically the code everyday when we receive new data, we created an Oozie Coordinator that use our workflows
Test locally
We used Hadoop Unit to write unit tests for our algorithms, so we can develop without deploying to the cluster.
We split our code to be able to isolate the spark conf and the side effects in the tests:
Version and deploy Oozie workflows
TODO
Future
Continuous deployment
Continuous integration
Last updated