Saturday, June 2, 2018

How to use Azure Blob Storage as Default File System for Local HDP Cluster

Set aside, for the moment, whether or not is a good idea to use blob storage as your default file system. Let's just see how to make it happen.

I have a use case which requires storing a non-trivial amount of data. My definition of "non-trivial" for this purpose is "more than I want to store on a SAN" and "not enough to justify a separate storage purchase." Additionally, Azure agreements and resources are already in place. This data will have some recurring analysis value, probably on a monthly or quarterly basis, and performance is not a primary factor.

For this use case, WASB might be a good place to start. If the performance is poor, the data can be moved to Azure Data Lake, Managed Disks, or the compute can be moved to the cloud alongside the data. For this post, the focus is on block blobs, since the access will likely be via Hive. Page blobs are good for HBase use cases.

Not covered in this post are page blobs and protecting your credentials. Future posts will discuss using blob storage with other HDP and HDF services.

I used a number of articles as resources and will refer to them as needed. The following two are good reads to understand many of the considerations for this use of WASB.

A good overview Azure blob storage and Hadoop is provided at this Apache support page:
https://hadoop.apache.org/docs/stable/hadoop-azure/index.html

This thread discusses configuration of WASB for use in Azure IaaS implementation:
https://community.hortonworks.com/questions/1635/instructions-to-setup-wasb-as-storage-for-hdp-on-a.html 

To use WASB as the default File System for your local HDP cluster, here are the high-level points:
  1. Use a clean HDP cluster: Use a cluster on which you can safely replace your data directories. A clean install is great for this example
  2. Configure authentication between local cluster and Azure
    • Retrieve Azure container Access key(s) from Azure portal
    • Configure custom key: fs.azure.account.key.<StorageAccount>.blob.core.windows.net
  3. Configure core-site.xml to use WASB as file system
    • Configure custom key: fs.AbstractFileSystem.wasb.impl
    • Configure custom key: fs.defaultFS

Configure Authentication


Before moving forward, verify that the WASB connector is functioning properly. You may want to create two containers - one container for testing miscellaneous WASB tests, and a second container kept clean so you can see the folder structure that is created during this example.

Configure core-site.xml 

Using Ambari, navigate to the HDFS Configs and add/modify these keys:


Add these properties:

 Modify this property:


The resulting properties in the core-site.xml will look like this:
<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>[StorageAccountAccessKey]</value>
</property> 
<property>
  <name>fs.AbstractFileSystem.wasb.impl</name>
  <value>org.apache.hadoop.fs.azure.Wasb</value>
</property> 
<property>
  <name>fs.defaultFS</name>  <value>wasb://[ContainerName]@[StorageAccount].blob.core.windows.net/value>
</property>  

Restart Services

After making these changes, Ambari will prompt for a restart of services. Restart those services. you will know the change is successful if...
  • ...no errors on restart of services
  • ...new properties in Advanced Core-Site in HDFS configs


  • ..Azure blob container has the Hadoop file structure

Run Some Tests 


  • Use Ambari to create folders and upload files
  • Use command line to MKDIR and PUT files into HDFS

Saturday, July 15, 2017

How-to: Capture and Ingest Windows Event Logs Using NiFi

One of the use cases I wanted to prove out was the consumption of Windows Event logs. For this proof-of-concept I am using Apache NiFi. The walk-through will reference other posts that cover individual components of this approach.

This article focuses on collecting event data from the local host and landing the data in both the local Windows file system as well as in HDFS.

Future posts will dive into using MiNiFi for remote hosts, as well as, methods to visualize these event data.

My setup:

  • Hadoop cluster based on Hortonworks HDP 2.6
    • CentOS 7
  • Windows 10 PC
    • Windows Subsystem for Linux enabled
    • MobaXterm
    • WinSCP

Following are the basic steps:
  1. Verify that Java JDK is installed and JAVA_HOME environment variable is set
  2. Download, install, and configure NiFi on a Windows host
  3. Develop and configure the NiFi process
  4. Run the NiFi job
  5. Validate that data are captured as expected

Download, Install, Configure, and Run NiFi on a Windows Host

The  "Getting Started Guide" on Apache's website is straightforward - I've abbreviated the portions needed for this use case.

From the Downloads page select the appropriate version of the binary .zip (for this example I used 1.3.0). If you use a different version, modify the oaths used accordingly.
  1. Extract .zip to your desired location
  2. The default port for NiFi is 8080
    • Since 8080 is a popular port for for web-enabled applications, you may want to change the port on which NiFi listens
    • The port can be configured in the nifi.properties file:

      <install directory>\nifi-1.3.0\conf\nifi.properties

      # web properties #

      nifi.web.http:port=8080
       
  3. Copy the core-site.xml and hdfs-site.xml configuration files to the Windows file system.
    • This is only necessary if you are landing the data in HDFS
    • In my setup, these files are located in /etc/hadoop/conf/. Depending on your Hadoop implementation, there may be slight variations
    • Windows Subsystem for Linux now has support for SCP. Alternatively, you can use MobaXterm to transfer files between windows and Linux systems
    • Use WinSCP to get core-site.xml and hdfs-site.xml to Windows. I placed mine in <install directory>\nifi-1.3.0\conf\ with the rest of the NiFi configuration files. It doesn't matter where you place them, as long as we can navigate to them later
  4. Start NiFi 
    • Open a CMD prompt with administrator permissions
    • Navigate to <install directory>\nifi-1.3.0\bin\ and execute run-nifi.bat 
    • You should see a screen like this:
       

Develop and Configure the NiFi Process

NiFi uses Processors to do work. Creating a combination of Processors provides a powerful way to manage data flows. Windows event log data is presented as XML. This process will take the XML and transform it to JSON, flatten that JSON, and store that data for future use.

Processors are added dragging the processor icon onto the NiFi canvas. At that point you are prompted to select the processor to be used. 



  1. Get setup to consume Windows Event data
    • Apply permissions in Windows to allow programmatic access to the event data channel(s)
      • There are instructions linked on the processor help screen, but to keep as much of this in one place as possible, I've included a grief explanation here
      • Open a new Administrator permission version of CMD and use these commands to get info and apply permissions
      • wmic useraccount get name,sid

        (in some cases you may still will not have permission to access this data. If so, use this command instead to show the logged-in user's info)

        whoami /user

        ...at least one of these commands will return the info needed - your SID value
      • wevtutil gl <CHANNEL>

        ...where <CHANNEL> is the event channel to which NiFi will listen. This command displays the current permissions for listening to the channel provided. You should see a result something like this: 









      • wevtutil sl Security /ca:O:BAG:SYD:(A;;0xf0005;;;SY)(A;;0x5;;;BA)(A;;0x1;;;S-1-5-32-573)
      • Channels are equivalent to the logs you see in Windows Event Viewer (Application, Security, etc.). For my case, I used the 'Security' channel


      • wevtutil sl Security /ca:O:BAG:SYD:(A;;0xf0005;;;SY)(A;;0x5;;;BA)(A;;0x1;;;S-1-5-32-573)(A;;0x1;;; <SID
    • Develop the XPath Query for the events of interest
      • The easiest way to do this is to open the Windows Event Viewer and select the desired options by using Filter Current Log


      • After making your filter selections, click on the XML tab and copy the XPath query




  2. Add ConsumeWindowsEventLog processor

  3. Add TransformXML processor

  4. Add JoltTransform processor

  5. Add a second JoltTransform processor

  6. Add PutHDFS processor

  7. Add PutFile processor


Run the NiFi Job


Validate that Data are Captured as Expected



How-to: Setup Java JDK on Windows for Development Purposes

Most of the posts I make here will have a Java component. With that in mind, it makes sense that there be some information about making sure that is installed an configured correctly. You can always go direct to the source, but since you are here, I've included the basics.

If you are going to be doing development on a Windows computer, these are the basic steps you need to complete to ease you along:
  1. Download the JDK needed
  2. Install the JDK
  3. Set the JAVA_HOME environment variable
  4. Add the JDK's /bin folder to the PATH environment variable

Download the JDK

On this page there is a lot of good information that can help you determine what to do next. Sometimes all of that info is too much info. In this case, we are going to install JDK 8. 
  1. Navigate to Oracle's Java Downloads page
  2. Click the Java icon for Java Platform (JDK) 8uXXX 
  3. Check the radio button to 'Accept the License Agreement' in the middle of the page
  4. Click on the JDK you want to download (Windows x86 or x64), in this case, jdk-8u131-windows-x64.exe

Install the JDK

  1. Once the .exe installer is downloaded, run it
  2. Unless you have a compelling reason, it is easiest to stick with the default install path
  3. Select the default selections, unless you know why you need to make changes

Set the JAVA_HOME Environment Variable

  1. Get the path to the JDK root folder

  2. Navigate to the environment variables. In Windows 10, right-click Start icon and select System and then choose Advanced System Settings



    ~or~

    left-click Start and start typing 'Environment' until Edit the System Environment Variables appears as an option. Select that option


  3. Create and set the environment variable.
    Variable Name: JAVA_HOME
    Variable Value: C:\Program Files\Java\jdk1.8.0_131

  4. Click OK two times to save the changes

Add JDK /bin folder to the PATH

  1. To add the /bin folder to the PATH variable, select it from the same dialogue box from where you created the JAVA_HOME variable
  2. Select the PATH variable and click Edit...
  3. Add the same path used in the JAVA_HOME variable, but add the subfolder /bin to the end of that path. Your new value should be something like this:

    C:\Program Files\Java\jdk1.8.0_131\bin