Difference between revisions of "Cloudera(tm) Hadoop Integration Guide"

From OSNEXUS Online Documentation Site
Jump to: navigation, search
m (Fixup JAVA_HOME for MapReduce Task Tracker service)
m (Step 6. Create a Network Share via QuantaStor web management interface)
Line 91: Line 91:
 
=== Step 6. Create a Network Share via QuantaStor web management interface ===
 
=== Step 6. Create a Network Share via QuantaStor web management interface ===
  
When you create a ''network share'' in QuantaStor it will create a folder within the storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share.  For example, if you name your share 'hadoopdata' there will be a folder created called /export/hadoopdata which is bound to your storage pool.  This folder is where you will store the Hadoop DFS data rather than in a folder on the QuantaStor system/boot drive.
+
When you create a ''network share'' in QuantaStor it will create a folder within your storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share.  For example, if you name your share 'hadoopdata' there will be a folder created called /export/hadoopdata which is bound to your storage pool.  This folder is where you will store the Hadoop DFS data rather than in a folder on the QuantaStor system/boot drive.
  
 
Besides using the /export/hadoopdata path you can optionally use the full absolute path to the network share folder which would be something like '/mnt/storage-pools/qs-<POOL-GUID>/hadoopdata'.
 
Besides using the /export/hadoopdata path you can optionally use the full absolute path to the network share folder which would be something like '/mnt/storage-pools/qs-<POOL-GUID>/hadoopdata'.
  
 
NOTE: Although you could access this share over NFS/CIFS to do backups you'll probably want to disable all other access for security reasons since this folder will be the repository for internal Hadoop DFS data and shouldn't be directly accessed by users via NFS.
 
NOTE: Although you could access this share over NFS/CIFS to do backups you'll probably want to disable all other access for security reasons since this folder will be the repository for internal Hadoop DFS data and shouldn't be directly accessed by users via NFS.
 +
 +
NOTE: You'll need to create a Storage Pool within your QuantaStor appliance before you can create a Network Share.  Click on the 'System Checklist' in button QuantaStor manager ribbon-bar for a quick run-down of the initial configuration steps you'll need to do to get your appliance initially setup.
  
 
=== Step 7. Setting the Hadoop DFS data directory ===
 
=== Step 7. Setting the Hadoop DFS data directory ===

Revision as of 00:53, 29 January 2013

This integration guide is focused on configuring your QuantaStor storage appliances as Hadoop data nodes for better Hadoop performance with less hardware. Note that you'll still need to setup the Hadoop master services like NameNode and ResourceManager on your Hadoop master node servers or virtual machines. Note also that you can still use your QuantaStor system normally as SAN/NAS appliance with the Hadoop services installed.

Setting up Cloudera® Hadoop™ within your QuantaStor® storage appliance is very similar to the steps you would use with a standard Ubuntu™ Precise server as this is the distro which QuantaStor v3 builds upon. There are some important differences though and this guide covers all that. Also note that if you're following instructions from the Cloudera web site for steps outside of this how-to be sure refer to the sections regarding Ubuntu Precise (v12.04).

To get started you'll need to login to your QuantaStor v3 storage appliance using SSH or via the console. Note that all of the commands shown in the how-to guide steps below should be run as root so be sure to run 'sudo -i' to get super-user privileges before you start.

Last, if you don't yet have a QuantaStor v3 storage appliance you can get the CD-ROM ISO and a license key here.

Step 1. Add the Cloudera package server

We'll be installing Cloudera's pre-packaged version of Hadoop so we'll start by adding their GPG key to our config so that the QuantaStor server trusts the packages they've signed. To do this, run this command to add the GPG key for Cloudera's packages:

sudo -i 
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

Next you'll need to tell QuantaStor where the Cloudera package servers are located by creating a sources file called /etc/apt/sources.list.d/cloudera.list with the following contents:

deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib
deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib

You can create the above file using nano or vi or you can quickly create this file with the necessary content using a couple of echo commands:

echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" > /etc/apt/sources.list.d
echo "deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" >> /etc/apt/sources.list.d

Step 2. Update the package repository

Now we can update the local system's package repository database like so:

apt-get update

Step 3. Configure Java Path (JAVA_HOME)

Hadoop includes a nice script that tries to detect where the java runtime is installed but unfortunately it has a little trouble finding the amd64 Open JDK we include with QuantaStor so you'll need to make this symbolic link to help it out:

ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-openjdk

If you forget this step you'll see an error like this when the Hadoop DataNode service tries to start:

root@QSDGRID-node001:/etc/apt/sources.list.d# service hadoop-hdfs-datanode start
 * Starting Hadoop datanode:
Error: JAVA_HOME is not set and could not be found.

Step 4. Install Hadoop DataNode package

Finally, we're ready to install Hadoop's DataNode service. Generally speaking you'll run your NameNodes on separate servers from QuantaStor where QuantaStor nodes are running as just DataNode with NodeManager or Task Tracker services like so:

sudo apt-get install hadoop-hdfs-datanode

More detailed instructions can be found here on the Cloudera web site: Install CDH4 with MRv1

Step 5. Install Hadoop MapReduce v1 (optional)

If you're going to use MapReduce v1 then run this step, otherwise skip it and go to Step 6. You'll install MapReduce v2/YARN when you get to Step 8.

sudo apt-get install hadoop-0.20-mapreduce-tasktracker 

Fixup JAVA_HOME for MapReduce Task Tracker service

If you're using MapReduce v1 you'll find that the startup script for the mapreduce-tasktracker service doesn't have it's JAVA_HOME setup properly so you can fix this by just adding this line to the top of the file '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' right after the line containing '### END INIT INFO'.

export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Alternatively you can add this script logic to the top of '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' (which is included in '/etc/init.d/hadoop-hdfs-datanode') that does JAVA_HOME auto-detection which we fixed in Step 3:

# Autodetect JAVA_HOME if not defined
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
  . /usr/libexec/bigtop-detect-javahome
elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
  . /usr/lib/bigtop-utils/bigtop-detect-javahome
fi

Step 6. Create a Network Share via QuantaStor web management interface

When you create a network share in QuantaStor it will create a folder within your storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share. For example, if you name your share 'hadoopdata' there will be a folder created called /export/hadoopdata which is bound to your storage pool. This folder is where you will store the Hadoop DFS data rather than in a folder on the QuantaStor system/boot drive.

Besides using the /export/hadoopdata path you can optionally use the full absolute path to the network share folder which would be something like '/mnt/storage-pools/qs-<POOL-GUID>/hadoopdata'.

NOTE: Although you could access this share over NFS/CIFS to do backups you'll probably want to disable all other access for security reasons since this folder will be the repository for internal Hadoop DFS data and shouldn't be directly accessed by users via NFS.

NOTE: You'll need to create a Storage Pool within your QuantaStor appliance before you can create a Network Share. Click on the 'System Checklist' in button QuantaStor manager ribbon-bar for a quick run-down of the initial configuration steps you'll need to do to get your appliance initially setup.

Step 7. Setting the Hadoop DFS data directory

By default the hdfs-site.xml file is not configured to use the network share that you created to house your Hadoop data. To set this you'll need to edit the /etc/hadoop/conf/hdfs-site.xml file and add a property to it. If you named your new share hadoopdata then you'd add the dfs.data.dir property to the config file like so:

  <property>
     <name>dfs.data.dir</name>
     <value>/export/hadoopdata</value>
  </property>

After you've added the entry the default config file will look something like this

<configuration>
  <property>
     <name>dfs.name.dir</name>
     <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
  </property>
  <property>
     <name>dfs.data.dir</name>
     <value>/export/hadoopdata</value>
  </property>
</configuration>

Step 8. Installing Hadoop YARN / MapReduce v2 / NodeManager (optional)

If you're using MapReduce v1 you can skip this step. Otherwise to install MRv2/YARN run this apt-get command:

sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

More detailed instructions can be found here: Install CDH4 with YARN

Fix QuantaStor's embedded Tomcat service to use a different port number

As it happens the NodeManager component seems to want to use port 8080 which QuantaStor's tomcat service is also using. The easiest way to fix this is to kick QuantaStor off that port number by editing two files. Change instances of port '8080' to '9080' in the /etc/init.d/iptables file and in the /opt/osnexus/quantastor/tomcat/conf/server.xml file. After you edit the server.xml file you can grep for 9080 and the output should look like this:

root@QSDGRID-node001:/opt/osnexus/quantastor/tomcat/conf# grep 9080 *
server.xml:         Define a non-SSL HTTP/1.1 Connector on port 9080
server.xml:    <Connector port="9080" protocol="HTTP/1.1"
server.xml:               port="9080" protocol="HTTP/1.1"

After you edit the /etc/init.d/iptables file the output should look like this:

root@QSDGRID-node001:/etc/init.d# grep 9080 iptables
/sbin/iptables -t nat -I PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 9080

Now restart the QuantaStor's iptables configuration and embedded tomcat service:

service iptables restart
service tomcat restart

Step 9. Opening up access in the firewall to Hadoop service port numbers

As a last step you'll need to add these entries to the QuantaStor iptables configuration file located at '/etc/init.d/iptables'. Look in the iptables_start() section and then paste these lines right after the port range for Samba Support.

        #port range for Hadoop support
        /sbin/iptables -A INPUT -p tcp --dport 50010 -j ACCEPT
        /sbin/iptables -A INPUT -p tcp --dport 50020 -j ACCEPT
        /sbin/iptables -A INPUT -p tcp --dport 50030 -j ACCEPT
        /sbin/iptables -A INPUT -p tcp --dport 50060 -j ACCEPT
        /sbin/iptables -A INPUT -p tcp --dport 50075 -j ACCEPT

After you've updated the file just update your iptables configuration like so:

service iptables restart
service iptables status

The status will show your new 'ACCEPT' entries for Hadoop.

Summary

That's the basics of getting Hadoop running with QuantaStor. Now you're ready to deploy CDH and install components if you haven't done so already.

We are looking forward to automating some of the above steps and adding deeper integration features to monitor Hadoop this year and would appreciate any feedback on what you would most like to see. So if you have some ideas you'd like to share or would like to be a beta customer for new Hadoop integration features write us at support@osnexus.com or write me directly at steve (at) osnexus.com.

Thanks and Happy Hadooping!


Cloudera is a registered trademark of Cloudera Corporation, Hadoop is a registered trademark of the Apache Foundation, Ubuntu is a registered trademark of Canonical, and QuantaStor is a registered trademark of OS NEXUS Corporation.