Hi Techies,
Here I am posting a blog on Apache Sqoop. This blog will contain definition of sqoop , installation guide and some basic operations.
Let's start with Definition of Sqoop :
Sqoop
is a Command-line interface for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
You
can use Sqoop to import data from external structured datastores into
Hadoop Distributed File System or related systems like Hive and
HBase. Conversely, Sqoop can be used to extract data from Hadoop and
export it to external structured datastores such as relational
databases and enterprise data warehouses.
Moreover,
Sqoop integrates with Oozie, allowing you to schedule and automate
import and export tasks.
Sqoop
can be connected to various types of databases . For example it can
talk to mysql , Oracle , Postgress databases. It uses JDBC to connect
to them. JDBC driver for each of databases is needed by sqoop to
connect to them.
JDBC
driver jar for each of the database can be downloaded from internet.
Now
we will start with the sqoop
installation
part.
tar
-xvzf sqoop-1.4.1-incubating__hadoop-0.23.tar.gz
2.
Now declare SQOOP_HOME and PATH variables for sqoop.
Specify
the SQOOP_HOME and add Sqoop path variable so that we can directly
run the sqoop commands
For example i downloaded sqoop in following
directory and my environment variables look like this
export
SQOOP_HOME="/home/hadoop/software/sqoop-1.3.0"
export
PATH=$PATH:$SQOOP_HOME/bin
3.
Dowload the JDBC driver of the database you require and place it into
$SQOOP_HOME/lib folder.
4.
Setting up other environment variables :
|
Note
: When
installing Sqoop from the tarball package, you must make sure that
the environment variables JAVA_HOME
and HADOOP_HOME
are configured correctly. The variable HADOOP_HOME
should point to the root directory of Hadoop installation.
Optionally, if you intend to use any Hive or HBase related
functionality, you must also make sure that they are installed and
the variables HIVE_HOME
and HBASE_HOME
are configured correctly to point to the root directory of their
respective installation. |
export
HBASE_HOME="/home/ubantu/hadoop-hbase/hbase-0.94.2"
export
HIVE_HOME="/home/ubantu/hadoop-hive/hive-0.10.0"
export
HADOOP_HOME="/home/ubantu/hadoop/hadoop-1.0.4"
export
PATH=$HADOOP_HOME/bin/:$PATH
Now
we are done with installation part. Let's start with some basic
operations of Sqoop :
List
Databases in MySQL :
sqoop-list-databases
-connect jdbc:mysql://hostname/ --username root -P
Sqoop
Import :- Import data
from a relational database management system (RDBMS) such as MySQL or
Oracle into the Hadoop Distributed File System (HDFS) and its
subprojects (Hive, HBase).
Import
MySQL table into HDFS if table have primary key :
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest
In
this command the various options specified are as follows:
--connect
, --username , --password
: These
are connection parameters that are used to connect with the
database. This is no different from the connection parameters that
you use when connecting to the database via a JDBC connection.
--table
:
This parameter specifies the table which will be imported.
Import
MySQL table into HDFS if table have primary key with target file in
HDFS:
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --target-dir /user/ubantu/sqooptest/test
In
this command the various options specified are as follows:
Import
MySQL table into Hbase if table have primary key:
Please start Hbase instance by : start-hbase.sh
sqoop
import -connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --hbase-table hbasesqoop --column-family hbasesqoopcol1
--hbase-create-table
scan
'hbasesqoop'
get
'hbasesqoop','1'
Import
MySQL table into Hive if table have primary key:
sqoop-import
--connect jdbc:mysql://hostname/sqoop -username root -P -table
sqooptest --hive-table hivesqoop --create-hive-table --hive-import
--hive-home /home/ubantu/hadoop-hive/hive-0.10.0
Sqoop
Export: export the HDFS and its subproject (Hive, HBase) data
back into an RDBMS.
Export
from HDFS to RDBMS.
sqoop
export --connect jdbc:mysql://hostname/sqoop -username root -P
--export-dir /user/ubantu/sqooptest/test/import/part-m-00000 --table
sqoopexporttest --input-fields-terminated-by ','
--input-lines-terminated-by '\n'
In
this command the various options specified are as follows:
Note :
I had faced an issue that my sqoop's "--hive import" command was not working. It was always giving below mentioned error :
hive.HiveImport: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B
Below description could be one of the solution for this issue :
Sqoop adds HBase and Hive jars to the classpath. HBase and Hive both use thrift server. If in your hadoop cluster Hbase and Hive have different version of thrift server then Sqoop will not be able to judge which thrift server to be used and it will not load RDBS tables to Hive. This problem generally occurs when HBase and Hive both are installed on the box where you're executing Sqoop.
To identify if you are having above mentioned scenario in place, you can check the thrift version that Hbase and Hive is using, Just search for "*thrift*.jar".
If they are different then set HBASE_HOME to something non-existing to force Sqoop not load HBase' version of thrift.