Saturday, June 2, 2018

How to use Azure Blob Storage as Default File System for Local HDP Cluster

Set aside, for the moment, whether or not is a good idea to use blob storage as your default file system. Let's just see how to make it happen.

I have a use case which requires storing a non-trivial amount of data. My definition of "non-trivial" for this purpose is "more than I want to store on a SAN" and "not enough to justify a separate storage purchase." Additionally, Azure agreements and resources are already in place. This data will have some recurring analysis value, probably on a monthly or quarterly basis, and performance is not a primary factor.

For this use case, WASB might be a good place to start. If the performance is poor, the data can be moved to Azure Data Lake, Managed Disks, or the compute can be moved to the cloud alongside the data. For this post, the focus is on block blobs, since the access will likely be via Hive. Page blobs are good for HBase use cases.

Not covered in this post are page blobs and protecting your credentials. Future posts will discuss using blob storage with other HDP and HDF services.

I used a number of articles as resources and will refer to them as needed. The following two are good reads to understand many of the considerations for this use of WASB.

A good overview Azure blob storage and Hadoop is provided at this Apache support page:
https://hadoop.apache.org/docs/stable/hadoop-azure/index.html

This thread discusses configuration of WASB for use in Azure IaaS implementation:
https://community.hortonworks.com/questions/1635/instructions-to-setup-wasb-as-storage-for-hdp-on-a.html 

To use WASB as the default File System for your local HDP cluster, here are the high-level points:
  1. Use a clean HDP cluster: Use a cluster on which you can safely replace your data directories. A clean install is great for this example
  2. Configure authentication between local cluster and Azure
    • Retrieve Azure container Access key(s) from Azure portal
    • Configure custom key: fs.azure.account.key.<StorageAccount>.blob.core.windows.net
  3. Configure core-site.xml to use WASB as file system
    • Configure custom key: fs.AbstractFileSystem.wasb.impl
    • Configure custom key: fs.defaultFS

Configure Authentication


Before moving forward, verify that the WASB connector is functioning properly. You may want to create two containers - one container for testing miscellaneous WASB tests, and a second container kept clean so you can see the folder structure that is created during this example.

Configure core-site.xml 

Using Ambari, navigate to the HDFS Configs and add/modify these keys:


Add these properties:

 Modify this property:


The resulting properties in the core-site.xml will look like this:
<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>[StorageAccountAccessKey]</value>
</property> 
<property>
  <name>fs.AbstractFileSystem.wasb.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property> 
<property>
  <name>fs.defaultFS</name>  <value>wasb://[ContainerName]@[StorageAccount].blob.core.windows.net/value>
</property>  

Restart Services

After making these changes, Ambari will prompt for a restart of services. Restart those services. you will know the change is successful if...
  • ...no errors on restart of services
  • ...new properties in Advanced Core-Site in HDFS configs


  • ..Azure blob container has the Hadoop file structure

Run Some Tests 


  • Use Ambari to create folders and upload files
  • Use command line to MKDIR and PUT files into HDFS