MENU

Automation of large-scale Hadoop deployments

2017-08-02

The need for custom-tailored, automated deployment of Hadoop environments came from one of our Fortune 500 client’s due to the limitations of their existing infrastructure which was hard to operate and consistently hit cost thresholds because of manageability issues.

 

Starschema came up with a solution which was able to provide the required functionalities within a shorter development timeline using existing provisioning tools and leveraging some of the seldom used features of Hadoop services. The Hadoop deployment runs solely on Amazon AWS infrastructure. In contrast to the AWS built-in Hadoop service (EMR), this solution gave our customer the flexibility to install alternative Hadoop stack versions with a host of varied/optional services (micro services?) supporting them. This also included additional custom packages and libraries – providing the ability to support different development teams within the direct organization and also other departments across the company.

 

The deployed infrastructure consists of three major parts which are responsible for:

 

– Infrastructure provisioning (Networks, Firewalls, IAM roles, Instances, etc.)

– Software provisioning (OS packages, Hadoop dependencies, custom libraries, etc.)

– Hadoop provisioning

 

Due to the nature of our central configuration repository, the configuration and manageability of the environment is straightforward and supports all the necessary features such as multi-tenant environments, a highly-available cluster deployment, tailored Hadoop versions and services enabled by Kerberos with the additional custom configuration options.

 

Different pre-configuration packages were created to help teams deploy the Hadoop cluster configuration best suited to their departmental needs. These pre-configured packages provide cost savings by eliminating unnecessary configurations and services.

 

The deployed Hadoop environments were automatically created by software code and managed from a central repository, providing considerable improvement to earlier versions where most configurations, manual interventions, and unplanned outages were a burden during daily operation. Infrastructure-as-code methodology proved to be not only a solution to eliminate these problems but also made significant savings to its operational cost.

Comments are closed.