Tuesday, 12 March 2013

Apache Sqoop Installation Guide

Hi Techies,

Here I am posting a blog on Apache Sqoop. This blog will contain definition of sqoop , installation guide and some basic operations.

Let's start with Definition of Sqoop

Sqoop is a Command-line interface for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Moreover, Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.

Sqoop can be connected to various types of databases . For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

JDBC driver jar for each of the database can be downloaded from internet.

Now we will start with the sqoop installation part.

1. First of all download the latest stable tar file of sqoop from http://sqoop.apache.org/ :

tar -xvzf sqoop-1.4.1-incubating__hadoop-0.23.tar.gz

2. Now declare SQOOP_HOME and PATH variables for sqoop.

Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands
For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.3.0"
export PATH=$PATH:$SQOOP_HOME/bin

3. Dowload the JDBC driver of the database you require and place it into $SQOOP_HOME/lib folder.

4. Setting up other environment variables :



Note :
When installing Sqoop from the tarball package, you must make sure that the environment variables
JAVA_HOME and HADOOP_HOME are configured correctly. The variable HADOOP_HOME should point to the root directory of Hadoop installation. Optionally, if you intend to use any Hive or HBase related functionality, you must also make sure that they are installed and the variables HIVE_HOME and HBASE_HOME are configured correctly to point to the root directory of their respective installation.

export HBASE_HOME="/home/ubantu/hadoop-hbase/hbase-0.94.2"
export HIVE_HOME="/home/ubantu/hadoop-hive/hive-0.10.0"
export HADOOP_HOME="/home/ubantu/hadoop/hadoop-1.0.4"
export PATH=$HADOOP_HOME/bin/:$PATH

Now we are done with installation part. Let's start with some basic operations of Sqoop :

List Databases in MySQL :

sqoop-list-databases -connect jdbc:mysql://hostname/ --username root -P

Sqoop Import :- Import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS) and its subprojects (Hive, HBase).

Import MySQL table into HDFS if table have primary key :

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest

In this command the various options specified are as follows:
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be imported.
Import MySQL table into HDFS if table have primary key with target file in HDFS:

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --target-dir /user/ubantu/sqooptest/test

In this command the various options specified are as follows:
  • import: This is the sub-command that instructs Sqoop to initiate an import.
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be imported.
Import MySQL table into Hbase if table have primary key:

Please start Hbase instance by : start-hbase.sh

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --hbase-table hbasesqoop --column-family hbasesqoopcol1 --hbase-create-table

scan 'hbasesqoop'

get 'hbasesqoop','1'

Import MySQL table into Hive if table have primary key:

sqoop-import --connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --hive-table hivesqoop --create-hive-table --hive-import --hive-home /home/ubantu/hadoop-hive/hive-0.10.0

Sqoop Export: export the HDFS and its subproject (Hive, HBase) data back into an RDBMS

Export from HDFS to RDBMS.

sqoop export --connect jdbc:mysql://hostname/sqoop -username root -P --export-dir /user/ubantu/sqooptest/test/import/part-m-00000 --table sqoopexporttest --input-fields-terminated-by ',' --input-lines-terminated-by '\n'

In this command the various options specified are as follows:
  • export: This is the sub-command that instructs Sqoop to initiate an export.
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be populated.
  • --export-dir : This is the directory from which data will be exported.

Note :

I had faced an issue that my sqoop's "--hive  import" command was not working. It was always giving below mentioned error :

hive.HiveImport: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B

Below description could be one of the solution for this issue :


Sqoop adds HBase and Hive jars to the classpath. HBase and Hive both use thrift server. If in your hadoop cluster Hbase and Hive have different version of thrift server then Sqoop will not be able to judge which thrift server to be used and it will not load RDBS tables to Hive. This problem generally occurs when HBase and Hive both are installed on the box where you're executing Sqoop. 
To identify if you are having above mentioned scenario in place, you can check the thrift version that Hbase and Hive is using, Just search for "*thrift*.jar". 
If they are different then set HBASE_HOME to something non-existing to force Sqoop not load HBase' version of thrift.