This blog was co-authored by Ryan Peterson, Head of Global Data Segment at AWS .
Central to empowering businesses to deliver the right data in the right environment to power the right use case is the ability for location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments.
Hortonworks’ Data Lifecycle Manager (DLM), an extensible service built on the Hortonworks DataPlane Platform (DPS) provides a complete solution to replicate HDFS, Hive data, metadata and security policies between on-premises and Amazon S3. This data movement enables Data science and ML workloads to execute models in Amazon SageMaker and bring back the successful data to on-premise. To facilitate this use case, here are the 3 steps for replication between HDFS to AWS cloud:
Step 1: Add Source Cluster for Replication
Using DPS, add an Ambari-managed HDP cluster. DPS provides information such as location of the cluster, number of nodes, uptime details of the cluster. (Fig 1). Identify and designate one of the clusters as source-cluster for the AWS-S3 cloud replication.
1.1 Make sure that the logged-in user belongs to Dataplane Admin role (Fig 2)
1.2 From the top left corner, go to Data Lifecycle Manager app and click on the clusters page to see the list of clusters that can be used for replication. (Fig 3)
1.3. User is then directed to DLM Cluster Dashboard page to view clusters, location, status, usage, HDP and DLM versions, and nodes. There is also an option to go to Ambari page from the cluster menu option. (Fig 4)
Step 2: Add Cloud Credentials
To replicate data to AWS S3, add cloud credentials either by using Key or Role based authentication methods. Click on the Cloud Credentials tab, click on Add button to add the cloud credentials. Ensure the credentials are validated before saving the credentials. (Fig 5.1, 5.2 and 5.3)
Step 3: Create DLM Policy
3.1 In the DLM navigation pane, click Policies. (Fig 6)
3.2 The Replication Policies page displays a list of existing policies. Click “Add Policy”. (Fig 7)
3.3 Enter or select the following information:
- Policy Name (Required) –> Enter the policy name of your choice
- Service –> Select HDFS (Fig 8)
3.4. On Select Source page –> Select Type as “Cluster” –> Select Source Cluster as one of the clusters that you added in the previous section (source cluster). (Fig 9)
3.5. Navigate using File browser and select the existing folder path on source cluster (e.g., /apps/traffic_data) – (Fig 10)
- On the Select Destination page –> Select Type as S3 –> Select a cloud credential (Fig 11)
- Select the path to S3 bucket. If the bucket exists, DLM will replicate the content to that bucket provided the cloud credentials have write-access to the bucket. If the bucket does not exist, DLM creates the bucket and replicates the content from source cluster.
- On the Select Destination page –> Select Type as S3 –> Select a cloud credential – Select Path → Select Encryption Type. Two types of encryption protocols are supported – SSE-S3 and SSE-KMS. DLM overrides the S3 bucket encryption with the encryption selected in this step and click validate. (Fig 11)
- Validate the path and click on “Schedule”. “Validate” ensures that the user has required file permissions to copy to the Destination cluster. (Fig 11)
3.6. In Run Job Section, click on “From Now” and enter the frequency of replication (e.g., Freq – 5 and select Minutes(s) – for demo purpose) and Click Advanced Settings. (Fig 12)
3.7. Queue, Maximum bandwidth and Number of mappers are optional parameters. Set them if required for the replication. Click “Create Policy”. (Fig 13)
3.8. Once the policy is created successfully, you can see the alert Policy Created successfully. (Fig 14)
3.9. Now you can see the policy is successfully created and bootstrap is in progress. Bootstrap is defined as the first time the full copy of the dataset object from source to destination AWS-S3 bucket. The subsequent policy instance will execute based on incremental replication which copies only the changed/updated data from source to destination dataset. (Fig 15)
3.10. Once the bootstrap completes successfully, second DLM policy instance initiates incremental replication. (Fig 16)
Step 4. Review the replicated data in AWS-S3 buckets. You can see the data replicated from source cluster to AWS-S3 environment (Fig 17)
Step 5. A critical step is to bring back successfully executed Data Science models on cloud to on-premise cluster. Create a DLM policy to bring back selected data (models and data) from AWS-S3 to on-premise cluster (Fig 18)
Visit Hortonworks at booth #629 in the Sands Expo Center during AWS re:Invent 2018 to learn more about Hortonworks, DLM and the winning combination of Big Data and Cloud.