Hadoop filesystem at Twitter

Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here.

ViewFs makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters. HDFS Federation helps with scaling the filesystem to our needs for number of files and directories while NameNode High Availability helps with reliability within a namespace. These features combined add significant complexity to managing and using our several large Hadoop clusters with varying versions. ViewFs removes the need for us to remember complicated URLs by using simple paths. Configuring ViewFs itself is a complex task at our scale. Thus, we run TwitterViewFs, a ViewFs extension we developed, that dynamically generates a new configuration so we have a simple holistic filesystem view.

Hadoop at Twitter: scalability and interoperability
Our Hadoop filesystems host over 300PB of data on tens of thousands of servers. We scale HDFS by federating multiple namespaces. This approach allows us to sustain a high HDFS object count (inodes and blocks) without resorting to a single large Java heap size that would suffer from longGC pauses and the inability to use compressed oops. While this approach is great for scaling, it is not easy for us to use because each member namespace in the federation has its own URI. We use ViewFs to provide an illusion of a single namespace within a single cluster. As seen in Figure 1, under the main logical URI we create a ViewFs mount table with links to the appropriate mount point namespaces for paths beginning with /user, /tmp, and /logs, correspondingly.

Read More

Spark and Storm face new competition for real-time Hadoop processing

Real-time processing of streaming data in Hadoop typically comes down to choosing between two projects: Storm or Spark. But a third contender, which has been open-sourced from a formerly commercial-only offering, is about to enter the race, and like those components, it may have a future outside of Hadoop.

DataTorrent RTS (real-time streaming) has long been a commercial offering for live data processing apart from the family of Apache Foundation open source projects around Hadoop. But now DataTorrent (the company) is preparing to open-source the core DataTorrent RTS engine, offer it under the same Apache 2.0 licensing as its competitors, and eventually contribute it to the Apache Foundation for governance.

Built for business

Project Apex, as the open source version of DataTorrent RTS’s engine is to be called, is meant to not only compete with Storm and Spark but to be superior to them — to run faster (10 to 100 times faster than Spark, it’s claimed), to be easier to program, to better support enterprise needs like fault tolerance and scalability, and to make it easier to demonstrate the value of Hadoop to a business owner. (more…)

Read More

How to build Hue on Ubuntu Cluster

Here is a step by step guide about how to get up and running Hue.


Step1: Fetch Hue source code from github

sudo apt-get install git
git clone https://github.com/cloudera/hue.git
cd hue

Step2: Install couple of development packages separately or at once, using command below:

sudo apt-get install ant gcc g++ libkrb5-dev libmysqlclient-dev libssl-dev libsasl2-dev libsasl2-modules-gssapi-mit libsqlite3-dev libtidy-0.99-0 libxml2-dev libxslt-dev make libldap2-dev maven python-dev python-setuptools libgmp3-dev

Step4: Time to build hue

make apps

Step5: Start the development server:

./build/env/bin/hue runserver

and now visit !

Read More