Difference between revisions of "Cloudera(tm) Hadoop Integration Guide"

From OSNEXUS Online Documentation Site
Jump to: navigation, search
m
m
Line 62: Line 62:
  
 
<br>
 
<br>
Step 1.1.
+
==== Step 1.1 Potential error, package dependency ====
 +
 
 
Note that this step includes the installation of a Java package, using the apt-get utility. Successful installation includes the target node meeting the package dependency requirements. If this stage fails, and the log shows package dependency errors:
 
Note that this step includes the installation of a Java package, using the apt-get utility. Successful installation includes the target node meeting the package dependency requirements. If this stage fails, and the log shows package dependency errors:
 
- Return to the command line
 
- Return to the command line
Line 167: Line 168:
 
Figure 2-11
 
Figure 2-11
  
 +
<br>
 +
Hit 'Continue' to move to the next phase.
  
=== Step 3. Configure Java Path (JAVA_HOME) ===
+
==== Step 2.1. Potential error, package dependency ====
 +
<br>
 +
As in Step 1.1 above, this step includes using the apt-get utility to install a Java package on all the other (besides the manager node) nodes in the cluster, and the same potential for package dependency errors can cause the cluster installation to fail on the non-manager nodes, as shown in the example figure below.
  
Hadoop includes a nice script that tries to detect where the java runtime is installed but unfortunately it has a little trouble finding the amd64 Open JDK we include with QuantaStor so you'll need to make this symbolic link to help it out:
+
<br>
 +
[[File:Hadoop_install_webintf_072.jpg| Figure 2-12]]
 +
<br>
 +
Figure 2-12
  
<pre>
+
<br>
ln -s /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-openjdk
+
To correct, log into each of the nodes that failed and perform the same steps as shown in Step 1.1. When this is complete, hit the 'Retry Failed Nodes' button on the screen.
</pre>
+
  
If you forget this step you'll see an error like this when the Hadoop DataNode service tries to start:
+
==== Step 2.2 Potential error, heartbeat detection failure ====
 +
<br>
 +
If installation succeeds past the package installation, the last phase of this step consists of heartbeat tests, as shown in the example figure below.
  
<pre>
+
<br>
root@QSDGRID-node001:/etc/apt/sources.list.d# service hadoop-hdfs-datanode start
+
[[File:Hadoop_install_webintf_074.jpg| Figure 2-13]]
* Starting Hadoop datanode:
+
<br>
Error: JAVA_HOME is not set and could not be found.
+
Figure 2-13
</pre>
+
  
=== Step 4. Install Hadoop DataNode package ===
+
<br>
 +
Sometimes these all succeed on the first try, other times some or all of the nodes fail.  
  
Finally, we're ready to install Hadoop's DataNode service. Generally speaking you'll run your NameNodes on separate servers from QuantaStor where QuantaStor nodes are running as just DataNode with NodeManager or Task Tracker services like so:
+
<br>
<pre>
+
[[File:Hadoop_install_webintf_075.jpg| Figure 2-14]]
sudo apt-get install hadoop-hdfs-datanode
+
<br>
</pre>
+
Figure 2-14
  
More detailed instructions can be found here on the Cloudera web site:
+
<br>
[https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step2%3AInstallCDH4withMRv1 Install CDH4 with MRv1]
+
Our testing showed that hitting "Retry Failed Nodes" (sometimes 2-3 times) resulted in heartbeat detection success.  
  
=== Step 5. Install Hadoop MapReduce v1 (optional) ===
+
<br>
 +
[[File:Hadoop_install_webintf_076.jpg| Figure 2-15]]
 +
<br>
 +
Figure 2-15
  
If you're going to use MapReduce v1 then run this step, otherwise skip it and go to Step 6.  You'll install MapReduce v2/YARN when you get to Step 8.
 
<pre>
 
sudo apt-get install hadoop-0.20-mapreduce-tasktracker
 
</pre>
 
  
==== Fixup JAVA_HOME for MapReduce Task Tracker service ====
 
If you're using MapReduce v1 you'll find that the startup script for the mapreduce-tasktracker service doesn't have it's JAVA_HOME setup properly so you can fix this by just adding this line to the top of the file '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' right after the line containing '### END INIT INFO'.
 
  
<pre>
 
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
 
</pre>
 
  
Alternatively you can add this script logic to the top of '/etc/init.d/hadoop-0.20-mapreduce-tasktracker' (which is included in '/etc/init.d/hadoop-hdfs-datanode') that does JAVA_HOME auto-detection which we fixed in Step 3:
 
  
<pre>
+
=== Step 3. Configure Java Path (JAVA_HOME) ===
# Autodetect JAVA_HOME if not defined
+
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
+
  . /usr/libexec/bigtop-detect-javahome
+
elif [ -e /usr/lib/bigtop-utils/bigtop-detect-javahome ]; then
+
  . /usr/lib/bigtop-utils/bigtop-detect-javahome
+
fi
+
</pre>
+
  
=== Step 6. Create a Network Share via QuantaStor web management interface ===
 
 
When you create a ''network share'' in QuantaStor it will create a folder within your storage pool and will mount it to /export/<SHARENAME> where SHARENAME is the name you gave to your network share.  For example, if you name your share 'hadoopdata' there will be a folder created called /export/hadoopdata which is bound to your storage pool.  This folder is where you will store the Hadoop DFS data rather than in a folder on the QuantaStor system/boot drive.
 
 
Besides using the /export/hadoopdata path you can optionally use the full absolute path to the network share folder which would be something like '/mnt/storage-pools/qs-<POOL-GUID>/hadoopdata'.
 
 
NOTE: Although you could access this share over NFS/CIFS to do backups you'll probably want to disable all other access for security reasons since this folder will be the repository for internal Hadoop DFS data and shouldn't be directly accessed by users via NFS.
 
 
NOTE: You'll need to create a Storage Pool within your QuantaStor appliance before you can create a Network Share.  Click on the 'System Checklist' in button QuantaStor manager ribbon-bar for a quick run-down of the initial configuration steps you'll need to do to get your appliance initially setup.
 
 
=== Step 7. Setting the Hadoop DFS data directory ===
 
 
By default the hdfs-site.xml file is not configured to use the network share that you created to house your Hadoop data.  To set this you'll need to edit the /etc/hadoop/conf/hdfs-site.xml file and add a property to it.  If you named your new share ''hadoopdata'' then you'd add the ''dfs.data.dir'' property to the config file like so:
 
<pre>
 
  <property>
 
    <name>dfs.data.dir</name>
 
    <value>/export/hadoopdata</value>
 
  </property>
 
</pre>
 
 
After you've added the entry the default config file will look something like this
 
 
<pre>
 
<configuration>
 
  <property>
 
    <name>dfs.name.dir</name>
 
    <value>/var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
 
  </property>
 
  <property>
 
    <name>dfs.data.dir</name>
 
    <value>/export/hadoopdata</value>
 
  </property>
 
</configuration>
 
</pre>
 
 
=== Step 8. Installing Hadoop YARN / MapReduce v2 / NodeManager (optional) ===
 
 
If you're using MapReduce v1 you can skip this step.  Otherwise to install MRv2/YARN run this apt-get command:
 
 
<pre>
 
sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
 
</pre>
 
 
More detailed instructions can be found here: [https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation#CDH4Installation-Step3%3AInstallCDH4withYARN Install CDH4 with YARN]
 
 
==== Fix QuantaStor's embedded Tomcat service to use a different port number ====
 
As it happens the NodeManager component seems to want to use port 8080 which QuantaStor's tomcat service is also using.  The easiest way to fix this is to kick QuantaStor off that port number by editing two files.  Change instances of port '8080' to '9080' in the /etc/init.d/iptables file and in the /opt/osnexus/quantastor/tomcat/conf/server.xml file.  After you edit the server.xml file you can grep for 9080 and the output should look like this:
 
<pre>
 
root@QSDGRID-node001:/opt/osnexus/quantastor/tomcat/conf# grep 9080 *
 
server.xml:        Define a non-SSL HTTP/1.1 Connector on port 9080
 
server.xml:    <Connector port="9080" protocol="HTTP/1.1"
 
server.xml:              port="9080" protocol="HTTP/1.1"
 
</pre>
 
After you edit the /etc/init.d/iptables file the output should look like this:
 
<pre>
 
root@QSDGRID-node001:/etc/init.d# grep 9080 iptables
 
/sbin/iptables -t nat -I PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 9080
 
</pre>
 
 
Now restart the QuantaStor's iptables configuration and embedded tomcat service:
 
<pre>
 
service iptables restart
 
service tomcat restart
 
</pre>
 
 
=== Step 9. Opening up access in the firewall to Hadoop service port numbers ===
 
As a last step you'll need to add these entries to the QuantaStor iptables configuration file located at '/etc/init.d/iptables'.  Look in the iptables_start() section and then paste these lines right after the port range for Samba Support.
 
<pre>
 
        #port range for Hadoop support
 
        /sbin/iptables -A INPUT -p tcp --dport 50010 -j ACCEPT
 
        /sbin/iptables -A INPUT -p tcp --dport 50020 -j ACCEPT
 
        /sbin/iptables -A INPUT -p tcp --dport 50030 -j ACCEPT
 
        /sbin/iptables -A INPUT -p tcp --dport 50060 -j ACCEPT
 
        /sbin/iptables -A INPUT -p tcp --dport 50075 -j ACCEPT
 
</pre>
 
After you've updated the file just update your iptables configuration like so:
 
<pre>
 
service iptables restart
 
service iptables status
 
</pre>
 
  
The status will show your new 'ACCEPT' entries for Hadoop.
 
  
 
=== Summary ===  
 
=== Summary ===  

Revision as of 16:49, 7 October 2013

This integration guide is focused on configuring your QuantaStor storage grid as Cloudera® Hadoop™ data cluster for better Hadoop performance with less hardware. Note that you can still use your QuantaStor system normally as SAN/NAS appliance with the Hadoop services installed.

Setting up Cloudera® Hadoop™ within your QuantaStor® storage appliance is very similar to the steps you would use with a standard Ubuntu™ Precise server as this is the Linux distribution upon which QuantaStor v3 builds. However there are some important differences and this guide covers all that. Also note that if you're following instructions from the Cloudera web site for steps outside of this how-to be sure refer to the sections regarding Ubuntu Precise (v12.04).

The Hadoop installation procedure begins by invoking an installation script, which is included with the current version of QuantaStor®. The installation proceeds in stages, as illustrated by the following screen shots. Note that the illustrated example consists of a cluster of four QuantaStor® nodes, one of which is designated the "Hadoop Manager" node.

Note also that on the Hadoop Manager node, the Hadoop management functionality will impose additional resource costs on the system. For example, when testing the installation using QuantaStor® nodes based on Virtual Machines, the Manager node required at least 10 GB of memory.

To get started you'll need to login to your QuantaStor v3 storage appliance(s) using SSH or via the console. Note that all of the commands shown in the how-to guide steps below should be run as root so be sure to run 'sudo -i' to get super-user privileges before you begin.

Last, if you don't yet have a QuantaStor v3 storage appliance you can get the CD-ROM ISO and a license key here.

Step 1. Manager Installation Script

From the superuser command-line on the manager node (hostname "osn-grid2-mgr1" in this example), simply invoke the "hadoop-install" script. This executable is located in the /bin directory, but it is in the path so it can be run from anywhere.

$ hadoop-install 

This script will take you through a series of screens, simply accept the EULA dialogs and allow it to proceed. This stage can take 10-15 minutes, and will install the web server for the Hadoop™ Management interface on the Manager node. The script will end with instructions for browsing into that interface to begin the next stage of the installation.

See the following screen shot examples.


Figure 1-01
Figure 1-01


Figure 1-02
Figure 1-02


Figure 1-03
Figure 1-03


Figure 1-04
Figure 1-04


Figure 1-05
Figure 1-05


Figure 1-06
Figure 1-06


Figure 1-07
Figure 1-07


Step 1.1 Potential error, package dependency

Note that this step includes the installation of a Java package, using the apt-get utility. Successful installation includes the target node meeting the package dependency requirements. If this stage fails, and the log shows package dependency errors: - Return to the command line - Run 'apt-get -f install' - Respond with 'Y' (must be uppercase) and allow this to complete - Accept the PAM configuration modification screen, if offered - Retry the hadoop-install script

See the following two screens as examples:


Figure 1-08
Figure 1-08


Figure 1-09
Figure 1-09


Step 2. Initial Web-Based Installation

Follow the instructions seen on the last screen of the previous stage, for example:

Point your web browser to hppt://<hostname or IP>:7180/. Log in to the Cloudera Manager with 
the username and password set to 'admin' to continue installation

Log into the interface using the 'admin/admin' user/password. You will see the initial screen asking you which edition you wish to install.


Figure 2-01
Figure 2-01


Figure 2-02
Figure 2-02

For the purposes of this example, the option for a minimal "Standard" edition is shown throughout. On this screen, as on all subsequent screens, hit 'Continue' to move on to the next step.

The next screen allows you to ender the host addresses of the nodes onto which Hadoop is to be installed. In this example, four IP addresses are shown, corresponding to the nodes 'osn-grid2-mgr1', 'osn-grid2a', 'osn-grid2b', and 'osn-grid2c'.

By invoking the 'Seach' button, the installation will test those node/addresses, and return status which should indicate that the nodes are ready and available for installation. When complete, hit 'Continue'.


Figure 2-03
Figure 2-03


Figure 2-04
Figure 2-04


Figure 2-05
Figure 2-05


The next screen, "Cluster Installation Screen 1", offers some installation options, again in this example we are installing the bare minimum, where we select the base CDH package only. The screen after that, "Cluster Installation Screen 2", requests login options. In this case, we are using the 'root' user where all nodes have the same root password.


Figure 2-06
Figure 2-06


Figure 2-07
Figure 2-07


The next screen, "Cluster Installation Screen 3", shows the cluster installation progress. Note the "Abort" popup dialog, do not hit "OK" here, this will abort the installation. Simply get rid of the popup.


Figure 2-08
Figure 2-08


If all goes well, this should progress as shown in the following example figures, resulting in "Installation completed successfully".


Figure 2-09
Figure 2-09


Figure 2-10
Figure 2-10


Figure 2-11
Figure 2-11


Hit 'Continue' to move to the next phase.

Step 2.1. Potential error, package dependency


As in Step 1.1 above, this step includes using the apt-get utility to install a Java package on all the other (besides the manager node) nodes in the cluster, and the same potential for package dependency errors can cause the cluster installation to fail on the non-manager nodes, as shown in the example figure below.


Figure 2-12
Figure 2-12


To correct, log into each of the nodes that failed and perform the same steps as shown in Step 1.1. When this is complete, hit the 'Retry Failed Nodes' button on the screen.

Step 2.2 Potential error, heartbeat detection failure


If installation succeeds past the package installation, the last phase of this step consists of heartbeat tests, as shown in the example figure below.


Figure 2-13
Figure 2-13


Sometimes these all succeed on the first try, other times some or all of the nodes fail.


Figure 2-14
Figure 2-14


Our testing showed that hitting "Retry Failed Nodes" (sometimes 2-3 times) resulted in heartbeat detection success.


Figure 2-15
Figure 2-15



Step 3. Configure Java Path (JAVA_HOME)

Summary

That's the basics of getting Hadoop running with QuantaStor. Now you're ready to deploy CDH and install components if you haven't done so already.

We are looking forward to automating some of the above steps and adding deeper integration features to monitor Hadoop this year and would appreciate any feedback on what you would most like to see. So if you have some ideas you'd like to share or would like to be a beta customer for new Hadoop integration features write us at support@osnexus.com or write me directly at steve (at) osnexus.com.

Thanks and Happy Hadooping!


Cloudera is a registered trademark of Cloudera Corporation, Hadoop is a registered trademark of the Apache Foundation, Ubuntu is a registered trademark of Canonical, and QuantaStor is a registered trademark of OS NEXUS Corporation.