Tuesday, 16 July 2013

What is BigData ?

What is BigData?


Beginners must be assuming, what is this new thing buzzing around these days ?

I would like to take an opportunity here and I would represent BigData in very easy amicable format.

Below image represent What is BigData ?






A question arises why should we care about this ? Yes, correct.

But valid questions are : How it can be useful? Who can make real use of it? How to adopt it? What are the solutions for it?

I would like to state some simple examples here : 

v  Assume how weather forecasting is happening since ages ? - Are they astrologers who are doing this, NO, BIGDATA - It’s all history data they have with them and which helps them to predict the weather.

v  Assume why Google tells you what you need to search whenever you start typing something. - BIGDATA - It's the humongous data they collect everyday from user searches.
v  Assume why Amazon/ebay shows you products those are similar to your likes? - BIGDATA, it's the data stored in their Back-end which constantly tells them user trends and helps to know user likings.

v  Assume why social media sites twitter, facebook, likedIn comes to you with on your home page "People you May Know" / "Who to Follow" - BIGDATA, they constantly keep comparing your profiles with others and find similar people.

I think I have given very simple examples How/Why/What THE BIGDATA can do ? There are plenty more cases but above examples will help you whenever you will find similar scenarios.

There are major problems like : where to store these bigData, how to make it scalable, how to process it and how to make use of it?

For all above questions now-a-days there are solutions in market to cop-up with all above problems, very popular among them are : Apache Hadoop, Cloudera CDH, Hortonworks HDP, NOSQL, Columnar DB, Graph DB, Massive Parallel Programming DB, Oracle Exadata etc.

But as much as I have explored, I would firmly stand with Apache Hadoop which is the best solution that helps to work with BigData. It has its own utilities which help developers, Business Analysts and business owners to adopt BigData easily.

Sunday, 7 July 2013

Hadoop VS. RDBMS

Hadoop VS. RDBMS


I am posting this because several times I have been asked about a database and Hadoop. People often get confused between these two entities. At first impression some people believe Hadoop is a replacement of database. But that is not true, its a file system like your current OS : Windows - NTFS, Linux - EXT3/EXT4 and Hadoop - HDFS. 

Below I have mentioned some major difference between a Database (RDBMS) and Hadoop. RDBMS is used for real-time data transactions (Active DB for your web application - Frequent Read, Write, Update and Delete operations) and Hadoop is used for batch processing of large data in TBs and PBs (Unlike RDBMS it works on : Write once and Read multiple principle).

Hadoop is a Distributed File System. It provides a cluster environment of Master and Slave nodes. Hence, with such architecture large data can be stored and processed in parallel. Variety od data can be analysed say, structured(tables), unstructured(logs, email body, blog text) and semi structured (media file metadata, xml, html)

RDBMS, as it name hints is more suitable for relational data and faster access of data.

In below figure you can find out some major differences between Hadoop and RDBMS : 


Please post your queries below for further details.

Thursday, 6 June 2013

Indexing BigData

Are you been thinking for indexing Big Data ?

Yes, a very exciting utility is going to be launched soon by Cloudera. Indexing on structured data has been proved very successful and efficient for faster search. But as data sources are growing day by day we are much concerned about indexing unstructured data and faster search on the same. Cloudera is building a utility called Cloudera Search which gives capability of faster search on unstructured data.

A very positive sign is , it is an open source utility. It has been built on very famous search indexing service called Lucene, SolR and Apche Flume.

A very good architectural model could be Flume + Apche SolR + HDFS.

Flume synching all the streaming live data (unstructured) with Apache SolR for indexing and Apache solR putting all large volume of indexed data into HDFS.

Just think now if you have data from a healthcare organization, which has a micro level information of individual patient, their each second health records then searching becomes very much difficult without indexing. If data is indexed although data is unstructured searching becomes many times faster. In this case for particular hospital, with particular group of patients and particular disease can be searched easily.

Indexing will make real time transactions faster for BigData.

Tuesday, 12 March 2013

Apache Sqoop Installation Guide

Hi Techies,

Here I am posting a blog on Apache Sqoop. This blog will contain definition of sqoop , installation guide and some basic operations.

Let's start with Definition of Sqoop

Sqoop is a Command-line interface for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Moreover, Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks.

Sqoop can be connected to various types of databases . For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

JDBC driver jar for each of the database can be downloaded from internet.

Now we will start with the sqoop installation part.

1. First of all download the latest stable tar file of sqoop from http://sqoop.apache.org/ :

tar -xvzf sqoop-1.4.1-incubating__hadoop-0.23.tar.gz

2. Now declare SQOOP_HOME and PATH variables for sqoop.

Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands
For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.3.0"
export PATH=$PATH:$SQOOP_HOME/bin

3. Dowload the JDBC driver of the database you require and place it into $SQOOP_HOME/lib folder.

4. Setting up other environment variables :



Note :
When installing Sqoop from the tarball package, you must make sure that the environment variables
JAVA_HOME and HADOOP_HOME are configured correctly. The variable HADOOP_HOME should point to the root directory of Hadoop installation. Optionally, if you intend to use any Hive or HBase related functionality, you must also make sure that they are installed and the variables HIVE_HOME and HBASE_HOME are configured correctly to point to the root directory of their respective installation.

export HBASE_HOME="/home/ubantu/hadoop-hbase/hbase-0.94.2"
export HIVE_HOME="/home/ubantu/hadoop-hive/hive-0.10.0"
export HADOOP_HOME="/home/ubantu/hadoop/hadoop-1.0.4"
export PATH=$HADOOP_HOME/bin/:$PATH

Now we are done with installation part. Let's start with some basic operations of Sqoop :

List Databases in MySQL :

sqoop-list-databases -connect jdbc:mysql://hostname/ --username root -P

Sqoop Import :- Import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS) and its subprojects (Hive, HBase).

Import MySQL table into HDFS if table have primary key :

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest

In this command the various options specified are as follows:
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be imported.
Import MySQL table into HDFS if table have primary key with target file in HDFS:

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --target-dir /user/ubantu/sqooptest/test

In this command the various options specified are as follows:
  • import: This is the sub-command that instructs Sqoop to initiate an import.
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be imported.
Import MySQL table into Hbase if table have primary key:

Please start Hbase instance by : start-hbase.sh

sqoop import -connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --hbase-table hbasesqoop --column-family hbasesqoopcol1 --hbase-create-table

scan 'hbasesqoop'

get 'hbasesqoop','1'

Import MySQL table into Hive if table have primary key:

sqoop-import --connect jdbc:mysql://hostname/sqoop -username root -P -table sqooptest --hive-table hivesqoop --create-hive-table --hive-import --hive-home /home/ubantu/hadoop-hive/hive-0.10.0

Sqoop Export: export the HDFS and its subproject (Hive, HBase) data back into an RDBMS

Export from HDFS to RDBMS.

sqoop export --connect jdbc:mysql://hostname/sqoop -username root -P --export-dir /user/ubantu/sqooptest/test/import/part-m-00000 --table sqoopexporttest --input-fields-terminated-by ',' --input-lines-terminated-by '\n'

In this command the various options specified are as follows:
  • export: This is the sub-command that instructs Sqoop to initiate an export.
  • --connect , --username , --password : These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection.
  • --table :
    This parameter specifies the table which will be populated.
  • --export-dir : This is the directory from which data will be exported.

Note :

I had faced an issue that my sqoop's "--hive  import" command was not working. It was always giving below mentioned error :

hive.HiveImport: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.thrift.EncodingUtils.setBit(BIZ)B

Below description could be one of the solution for this issue :


Sqoop adds HBase and Hive jars to the classpath. HBase and Hive both use thrift server. If in your hadoop cluster Hbase and Hive have different version of thrift server then Sqoop will not be able to judge which thrift server to be used and it will not load RDBS tables to Hive. This problem generally occurs when HBase and Hive both are installed on the box where you're executing Sqoop. 
To identify if you are having above mentioned scenario in place, you can check the thrift version that Hbase and Hive is using, Just search for "*thrift*.jar". 
If they are different then set HBASE_HOME to something non-existing to force Sqoop not load HBase' version of thrift.