Hadoop filesystem at Twitter

Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here.

ViewFs makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters. HDFS Federation helps with scaling the filesystem to our needs for number of files and directories while NameNode High Availability helps with reliability within a namespace. These features combined add significant complexity to managing and using our several large Hadoop clusters with varying versions. ViewFs removes the need for us to remember complicated URLs by using simple paths. Configuring ViewFs itself is a complex task at our scale. Thus, we run TwitterViewFs, a ViewFs extension we developed, that dynamically generates a new configuration so we have a simple holistic filesystem view.

Hadoop at Twitter: scalability and interoperability
Our Hadoop filesystems host over 300PB of data on tens of thousands of servers. We scale HDFS by federating multiple namespaces. This approach allows us to sustain a high HDFS object count (inodes and blocks) without resorting to a single large Java heap size that would suffer from longGC pauses and the inability to use compressed oops. While this approach is great for scaling, it is not easy for us to use because each member namespace in the federation has its own URI. We use ViewFs to provide an illusion of a single namespace within a single cluster. As seen in Figure 1, under the main logical URI we create a ViewFs mount table with links to the appropriate mount point namespaces for paths beginning with /user, /tmp, and /logs, correspondingly.

Read More

Install apache zeppelin service in ambari

Setup the Ambari service
  • To deploy the Zeppelin service, run below on ambari server
sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git   /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN


  • Restart Ambari
#on sandbox
service ambari restart

#on non-sandbox
sudo service ambari-server restart
  • Once Ambari comes back up and the services turn green, you can click on ‘Add Service’ from the ‘Actions’ dropdown menu in the bottom left of the Ambari dashboard:


Read More

Anaconda: Free enterprise-ready Python for Big data, Predictive Analytics

125+ cross-platform tested and optimized Python packages for advanced analytics totally free, even for commercial use.


Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing


Read More

The Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 40 petabytes of enterprise data at Yahoo!

8.1. Introduction

Hadoop1 provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce [DG04] paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.

Read More

Spark and Storm face new competition for real-time Hadoop processing

Real-time processing of streaming data in Hadoop typically comes down to choosing between two projects: Storm or Spark. But a third contender, which has been open-sourced from a formerly commercial-only offering, is about to enter the race, and like those components, it may have a future outside of Hadoop.

DataTorrent RTS (real-time streaming) has long been a commercial offering for live data processing apart from the family of Apache Foundation open source projects around Hadoop. But now DataTorrent (the company) is preparing to open-source the core DataTorrent RTS engine, offer it under the same Apache 2.0 licensing as its competitors, and eventually contribute it to the Apache Foundation for governance.

Built for business

Project Apex, as the open source version of DataTorrent RTS’s engine is to be called, is meant to not only compete with Storm and Spark but to be superior to them — to run faster (10 to 100 times faster than Spark, it’s claimed), to be easier to program, to better support enterprise needs like fault tolerance and scalability, and to make it easier to demonstrate the value of Hadoop to a business owner. (more…)

Read More