Hi Techies,
Here I am posting a blog on Apache Sqoop. This blog will contain definition of sqoop , installation guide and some basic operations.
Let's start with Definition of Sqoop :
Sqoop
is a Command-line interface for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
You
can use Sqoop to import data from external structured datastores into
Hadoop Distributed File System or related systems like Hive and
HBase. Conversely, Sqoop can be used to extract data from Hadoop and
export it to external structured datastores such as relational
databases and enterprise data warehouses.
Moreover,
Sqoop integrates with Oozie, allowing you to schedule and automate
import and export tasks.
Sqoop
can be connected to various types of databases . For example it can
talk to mysql , Oracle , Postgress databases. It uses JDBC to connect
to them. JDBC driver for each of databases is needed by sqoop to
connect to them.
JDBC driver jar for each of the database can be downloaded from internet.
Now
we will start with the sqoop
installation
part.
1.
First of all download the latest stable tar file of sqoop from
http://sqoop.apache.org/ :
tar
-xvzf sqoop-1.4.1-incubating__hadoop-0.23.tar.gz
2.
Now declare SQOOP_HOME and PATH variables for sqoop.
Specify
the SQOOP_HOME and add Sqoop path variable so that we can directly
run the sqoop commands
For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.3.0"
export PATH=$PATH:$SQOOP_HOME/bin
For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.3.0"
export PATH=$PATH:$SQOOP_HOME/bin
3.
Dowload the JDBC driver of the database you require and place it into
$SQOOP_HOME/lib folder.
4.
Setting up other environment variables :
Note
: When installing Sqoop from the tarball package, you must make sure that the environment variables JAVA_HOME and HADOOP_HOME are configured correctly. The variable HADOOP_HOME should point to the root directory of Hadoop installation. Optionally, if you intend to use any Hive or HBase related functionality, you must also make sure that they are installed and the variables HIVE_HOME and HBASE_HOME are configured correctly to point to the root directory of their respective installation. |
export
HBASE_HOME="/home/ubantu/hadoop-hbase/hbase-0.94.2"
export
HIVE_HOME="/home/ubantu/hadoop-hive/hive-0.10.0"
export
HADOOP_HOME="/home/ubantu/hadoop/hadoop-1.0.4"
export
PATH=$HADOOP_HOME/bin/:$PATH
Now
we are done with installation part. Let's start with some basic
operations of Sqoop :
List
Databases in MySQL :
sqoop-list-databases
-connect jdbc:mysql://hostname/ --username root -P
Sqoop
Import :- Import data
from a relational database management system (RDBMS) such as MySQL or
Oracle into the Hadoop Distributed File System (HDFS) and its
subprojects (Hive, HBase).
Import
MySQL table into HDFS if table have primary key :
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest
In
this command the various options specified are as follows:
- --connect
, --username These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection., --password : - --table :
This parameter specifies the table which will be imported.
Import
MySQL table into HDFS if table have primary key with target file in
HDFS:
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --target-dir /user/ubantu/sqooptest/test
In
this command the various options specified are as follows:
- import: This is the sub-command that instructs Sqoop to initiate an import.
- --connect
, --username These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection., --password : - --table :
This parameter specifies the table which will be imported.
Import
MySQL table into Hbase if table have primary key:
Please start Hbase instance by : start-hbase.sh
Please start Hbase instance by : start-hbase.sh
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --hbase-table hbasesqoop --column-family hbasesqoopcol1
--hbase-create-table
scan
'hbasesqoop'
get
'hbasesqoop','1'
Import
MySQL table into Hive if table have primary key:
sqoop-import
--connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --hive-table hivesqoop --create-hive-table --hive-import
--hive-home /home/ubantu/hadoop-hive/hive-0.10.0
Sqoop
Export: export the HDFS and its subproject (Hive, HBase) data
back into an RDBMS.
Export
from HDFS to RDBMS.
sqoop
export --connect jdbc:mysql://hostname/sqoop -username root -P
--export-dir /user/ubantu/sqooptest/test/import/part-m-00000 --table
sqoopexporttest --input-fields-terminated-by ','
--input-lines-terminated-by '\n'
In
this command the various options specified are as follows:
- export: This is the sub-command that instructs Sqoop to initiate an export.
- --connect
, --username These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection., --password : - --table :
This parameter specifies the table which will be populated.
- --export-dir
: This is the directory from which data will be exported.
Note :
I had faced an issue that my sqoop's "--hive import" command was not working. It was always giving below mentioned error :
hive.HiveImport: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B
Below description could be one of the solution for this issue :
Sqoop adds HBase and Hive jars to the classpath. HBase and Hive both use thrift server. If in your hadoop cluster Hbase and Hive have different version of thrift server then Sqoop will not be able to judge which thrift server to be used and it will not load RDBS tables to Hive. This problem generally occurs when HBase and Hive both are installed on the box where you're executing Sqoop.
To identify if you are having above mentioned scenario in place, you can check the thrift version that Hbase and Hive is using, Just search for "*thrift*.jar".
If they are different then set HBASE_HOME to something non-existing to force Sqoop not load HBase' version of thrift.