Linux Big Data and Hadoop

I recently came to know about Hadoop. I found the initial concept to be really intrguing. Especially after reading business cases about last.fm and NY times.

On digging further on Cassandra, MongoDB I was introduced to the concept of Big Data. I read many articles on Big Data and the problems they pose in future.

Yesterday I was discussing the same concepts to a DBA colleague of mine working on DB2 (RDBMS).

His contention is - Though it is understandable the use of Big Data in public domain - Google search, Large Hadron collider, Facebook etc. wherein data lacks a specific structure the same might not be applicable for pro business. Example of Walmart data (from Wikipedia: http://en.wikipedia.org/wiki/Big_data#Impact) will have a particular structure. A properly designed RDBMS or even MDBMS - OLAP can provide a solution for it. It is quite difficult to see Big Data being used for business like MDBMS for Analytics.

Please help me win this debate with some really good answers to his contention
<


PS: I am not really a DBA guy so please go easy on me if I missed something.
 
I sort of agree with the DBA guy for the same reasons as mentioned. For structured transactional purposes, RDBMS is still a better option.
 
Like the name suggests, its suited to handling really huge datasets and which aren't strictly structured. Your colleague is partially right, but I don't agree with the "pro business". As a general rule in computing, there is no single solution to all type of problems. That is why many companies that work with humongous data pick and choose different ways to solve different problems. For example, Facebook also uses MySQL heavily [http://www.facebook....ySQLatFacebook] and many people still complain that MySQL isn't scalable
 
@sharktale1212

I agree with you that for an avg business "now" an RDBMS and OLAP may hold good.

but as data increses current RDBMS dont help much..

RDBMS works on some basic principle like atomicity , locks which may not be great to handle too huge data.

they do hande some wat , its via various "work arounds" but basic problem persists

more over there is single point of failure in these systems.

Hadoop and cassandra arent replacement of DB and OLAP.

they are mearly tools to do things in different ways.

for parallel execution of stuffs which is more difficult in traditional way.

Hadoop is basicaly a job execution engine , with its FS (HDFS).

u can passs jobs to it , it will split and execute it for you.. (via map and reduce).

i seen this being used as feeder system for traditional DB based OLAP

as DB based OLAP dont scale to execute fastly.

Hadoop can use HBASE to do the above work as its own .. ig you wish to.

please understand all these systems are designed for parallelism and concurency .

and not for attomicity and integrity.
 
Guys thanks for the inputs.

#Blackprince - Thats one informative post. I have a doubt though: Considering the not so hard wired focus on intregrity, how does a business know if the data holds true throughout. The way I see it can work for things like web click streams but rarely applicable for say sales data reporting.

Also, as hadoop seems to be a more of batch processing, wont it better to use SQL - RDBMS for real time querying? I might be wrong here
<


#vishalrao - Thanks for the buzzwords. Wiki seems a bit dubious on scale out and up. Seems to favour scale out but then introduces hypervisor as a way to actually scale up.
 
Hadoop is more of ETL ,

it is in functionality two parts

1 Name node , 2.Data node(s)

It basically works the google's map reuse Algorithm (the key behind google search engine)..

it helps in scanning 1TB of data in 3.5 Sec(Hardware is important factor here expect something like exadata in place )

Name node is the most important servers that holds the data about data how the data is mutiplexed among datanodes

So,in generall Nsmenode should be of high configuration/capability server as its failure leads to failure of the entire cluster

It can be called a single point failure in the cluster while datanode have data that is multiplexed ,So even a loss of a datanode is acceptiable to an extent within the cluster .

Yahoo is currently the largest implementor of Hadoop with a 5k Cluster

Basic u can query or progam with Hive, pig

i known i have given bits of info but really eager to share.. will pass on more info once i am back home

my info is mostly from Administration side of it then on development

i have the hadoop certification in my list this year costs a uber 74,000 rupees that to at time specifed by Cloudera
 
Hadoop is more of ETL ,

it is in functionality two parts

1 Name node , 2.Data node(s)

It basically works the google's map reuse Algorithm (the key behind google search engine)..

it helps in scanning 1TB of data in 3.5 Sec(Hardware is important factor here expect something like exadata in place )

Name node is the most important servers that holds the data about data how the data is mutiplexed among datanodes

So,in generall Nsmenode should be of high configuration/capability server as its failure leads to failure of the entire cluster

It can be called a single point failure in the cluster while datanode have data that is multiplexed ,So even a loss of a datanode is acceptiable to an extent within the cluster .

Yahoo is currently the largest implementor of Hadoop with a 5k Cluster

Basic u can query or progam with Hive, pig

i known i have given bits of info but really eager to share.. will pass on more info once i am back home

my info is mostly from Administration side of it then on development

i have the hadoop certification in my list this year costs a uber 74,000 rupees that to at time specifed by Cloudera
Oh so you attended the cloudera sessions planned by Osscube? Thats really costly man.

Eager to hear more info from you.
 
I absolutely agree with @blackprince.

My 2cents:

Hadoop: It's a framework, doing map-n-reduce(broadly)...thats it.

BigData: A concept, pointing to huge-datasets, precisely issues pertaining to 3-V's, Velocity, Variety and Volume at which data is growing.

Now about your conversation with DBA guy:

See, now-a-days, its not just humans who are producing the data, but even machines are producing lots of data.

Things like sensors, mobiles etc are producing lots of data...may be without even your knowledge.

For example- GPS-sensor on your smartphone sending your location to Google-Latitude, without asking you everytime....so to conclude, data is raising very-rapidly.

Now how companies with good DBMS/RDBMS/OLAP setup can get benefit from this:

If you are talking about small companies, who need to do big-transactions(intensive queries) rarely in a month/week than definitely Hadoop will be a expensive-solution

But if use-case company is a mid/big one, doing some analytics/transactions over few million rows, over multiple-systems or cloud, than Hadoop can solve that job-distribution mechanism.

See, for handling large datasets, only distribution-of-jobs is the way to do....and if you are distributing-jobs than Hadoop can help you.

Hope it would make little more clear on this issue.
<
 
Back
Top