Ozymandias

Saturday, June 2, 2018

How to use Azure Blob Storage as Default File System for Local HDP Cluster

Set aside, for the moment, whether or not is a good idea to use blob storage as your default file system. Let's just see how to make it happen.

I have a use case which requires storing a non-trivial amount of data. My definition of "non-trivial" for this purpose is "more than I want to store on a SAN" and "not enough to justify a separate storage purchase." Additionally, Azure agreements and resources are already in place. This data will have some recurring analysis value, probably on a monthly or quarterly basis, and performance is not a primary factor.

For this use case, WASB might be a good place to start. If the performance is poor, the data can be moved to Azure Data Lake, Managed Disks, or the compute can be moved to the cloud alongside the data. For this post, the focus is on block blobs, since the access will likely be via Hive. Page blobs are good for HBase use cases.

Not covered in this post are page blobs and protecting your credentials. Future posts will discuss using blob storage with other HDP and HDF services.

I used a number of articles as resources and will refer to them as needed. The following two are good reads to understand many of the considerations for this use of WASB.

A good overview Azure blob storage and Hadoop is provided at this Apache support page:
https://hadoop.apache.org/docs/stable/hadoop-azure/index.html

This thread discusses configuration of WASB for use in Azure IaaS implementation:
https://community.hortonworks.com/questions/1635/instructions-to-setup-wasb-as-storage-for-hdp-on-a.html

To use WASB as the default File System for your local HDP cluster, here are the high-level points:

Use a clean HDP cluster: Use a cluster on which you can safely replace your data directories. A clean install is great for this example

If you need to deploy a fresh cluster, here are the instructions for HDP 2.6.5:
https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.0/bk_ambari-installation/content/ch_Getting_Ready.html

Configure authentication between local cluster and Azure

Retrieve Azure container Access key(s) from Azure portal
Configure custom key: fs.azure.account.key.<StorageAccount>.blob.core.windows.net

Configure core-site.xml to use WASB as file system

Configure custom key: fs.AbstractFileSystem.wasb.impl
Configure custom key: fs.defaultFS

Configure Authentication

The resource I used is this article by Dominika Bialek.
https://community.hortonworks.com/articles/105996/how-to-configure-authentication-with-wasb.html

Before moving forward, verify that the WASB connector is functioning properly. You may want to create two containers - one container for testing miscellaneous WASB tests, and a second container kept clean so you can see the folder structure that is created during this example.

Configure core-site.xml

Using Ambari, navigate to the HDFS Configs and add/modify these keys:

Add these properties:

Modify this property:

The resulting properties in the core-site.xml will look like this:

<property>
<name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
<value>[StorageAccountAccessKey]</value>
</property>

<property>
<name>fs.AbstractFileSystem.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>

<property>
<name>fs.defaultFS</name> <value>wasb://[ContainerName]@[StorageAccount].blob.core.windows.net/value>
</property>

Restart Services

After making these changes, Ambari will prompt for a restart of services. Restart those services. you will know the change is successful if...

...no errors on restart of services
...new properties in Advanced Core-Site in HDFS configs

..Azure blob container has the Hadoop file structure

Run Some Tests

Use Ambari to create folders and upload files
Use command line to MKDIR and PUT files into HDFS

Saturday, July 15, 2017

How-to: Capture and Ingest Windows Event Logs Using NiFi

One of the use cases I wanted to prove out was the consumption of Windows Event logs. For this proof-of-concept I am using Apache NiFi. The walk-through will reference other posts that cover individual components of this approach.

This article focuses on collecting event data from the local host and landing the data in both the local Windows file system as well as in HDFS.

Future posts will dive into using MiNiFi for remote hosts, as well as, methods to visualize these event data.

My setup:

Hadoop cluster based on Hortonworks HDP 2.6

CentOS 7

Windows 10 PC

Windows Subsystem for Linux enabled
MobaXterm
WinSCP

Following are the basic steps:

Verify that Java JDK is installed and JAVA_HOME environment variable is set
Download, install, and configure NiFi on a Windows host
Develop and configure the NiFi process
Run the NiFi job
Validate that data are captured as expected

Download, Install, Configure, and Run NiFi on a Windows Host

The "Getting Started Guide" on Apache's website is straightforward - I've abbreviated the portions needed for this use case.

From the Downloads page select the appropriate version of the binary .zip (for this example I used 1.3.0). If you use a different version, modify the oaths used accordingly.

Extract .zip to your desired location
The default port for NiFi is 8080

Since 8080 is a popular port for for web-enabled applications, you may want to change the port on which NiFi listens
The port can be configured in the nifi.properties file:

<install directory>\nifi-1.3.0\conf\nifi.properties

# web properties #
nifi.web.http:port=8080

Copy the core-site.xml and hdfs-site.xml configuration files to the Windows file system.

This is only necessary if you are landing the data in HDFS
In my setup, these files are located in /etc/hadoop/conf/. Depending on your Hadoop implementation, there may be slight variations
Windows Subsystem for Linux now has support for SCP. Alternatively, you can use MobaXterm to transfer files between windows and Linux systems
Use WinSCP to get core-site.xml and hdfs-site.xml to Windows. I placed mine in <install directory>\nifi-1.3.0\conf\ with the rest of the NiFi configuration files. It doesn't matter where you place them, as long as we can navigate to them later

Start NiFi

Open a CMD prompt with administrator permissions
Navigate to <install directory>\nifi-1.3.0\bin\ and execute run-nifi.bat
You should see a screen like this:

Develop and Configure the NiFi Process

NiFi uses Processors to do work. Creating a combination of Processors provides a powerful way to manage data flows. Windows event log data is presented as XML. This process will take the XML and transform it to JSON, flatten that JSON, and store that data for future use.

Processors are added dragging the processor icon onto the NiFi canvas. At that point you are prompted to select the processor to be used.

Get setup to consume Windows Event data

Apply permissions in Windows to allow programmatic access to the event data channel(s)

There are instructions linked on the processor help screen, but to keep as much of this in one place as possible, I've included a grief explanation here
Open a new Administrator permission version of CMD and use these commands to get info and apply permissions
wmic useraccount get name,sid

(in some cases you may still will not have permission to access this data. If so, use this command instead to show the logged-in user's info)

whoami /user

...at least one of these commands will return the info needed - your SID value
wevtutil gl <CHANNEL>

...where <CHANNEL> is the event channel to which NiFi will listen. This command displays the current permissions for listening to the channel provided. You should see a result something like this:
wevtutil sl Security /ca:O:BAG:SYD:(A;;0xf0005;;;SY)(A;;0x5;;;BA)(A;;0x1;;;S-1-5-32-573)
Channels are equivalent to the logs you see in Windows Event Viewer (Application, Security, etc.). For my case, I used the 'Security' channel
wevtutil sl Security /ca:O:BAG:SYD:(A;;0xf0005;;;SY)(A;;0x5;;;BA)(A;;0x1;;;S-1-5-32-573)(A;;0x1;;; <SID

Develop the XPath Query for the events of interest

The easiest way to do this is to open the Windows Event Viewer and select the desired options by using Filter Current Log
After making your filter selections, click on the XML tab and copy the XPath query

Add ConsumeWindowsEventLog processor

Add TransformXML processor

Add JoltTransform processor

Add a second JoltTransform processor

Add PutHDFS processor

Add PutFile processor

Run the NiFi Job

Validate that Data are Captured as Expected

How-to: Setup Java JDK on Windows for Development Purposes

Most of the posts I make here will have a Java component. With that in mind, it makes sense that there be some information about making sure that is installed an configured correctly. You can always go direct to the source, but since you are here, I've included the basics.

If you are going to be doing development on a Windows computer, these are the basic steps you need to complete to ease you along:

Download the JDK needed
Install the JDK
Set the JAVA_HOME environment variable
Add the JDK's /bin folder to the PATH environment variable

Download the JDK

On this page there is a lot of good information that can help you determine what to do next. Sometimes all of that info is too much info. In this case, we are going to install JDK 8.

Navigate to Oracle's Java Downloads page
Click the Java icon for Java Platform (JDK) 8uXXX
Check the radio button to 'Accept the License Agreement' in the middle of the page
Click on the JDK you want to download (Windows x86 or x64), in this case, jdk-8u131-windows-x64.exe

Install the JDK

Once the .exe installer is downloaded, run it
Unless you have a compelling reason, it is easiest to stick with the default install path
Select the default selections, unless you know why you need to make changes

Set the JAVA_HOME Environment Variable

Get the path to the JDK root folder
Navigate to the environment variables. In Windows 10, right-click Start icon and select System and then choose Advanced System Settings

~or~

left-click Start and start typing 'Environment' until Edit the System Environment Variables appears as an option. Select that option

Create and set the environment variable.
Variable Name: JAVA_HOME
Variable Value: C:\Program Files\Java\jdk1.8.0_131
Click OK two times to save the changes

Add JDK /bin folder to the PATH

To add the /bin folder to the PATH variable, select it from the same dialogue box from where you created the JAVA_HOME variable
Select the PATH variable and click Edit...
Add the same path used in the JAVA_HOME variable, but add the subfolder /bin to the end of that path. Your new value should be something like this:

C:\Program Files\Java\jdk1.8.0_131\bin