Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here.
ViewFs makes the interaction with our HDFS infrastructure as simple as a single namespace spanning all datacenters and clusters. HDFS Federation helps with scaling the filesystem to our needs for number of files and directories while NameNode High Availability helps with reliability within a namespace. These features combined add significant complexity to managing and using our several large Hadoop clusters with varying versions. ViewFs removes the need for us to remember complicated URLs by using simple paths. Configuring ViewFs itself is a complex task at our scale. Thus, we run TwitterViewFs, a ViewFs extension we developed, that dynamically generates a new configuration so we have a simple holistic filesystem view.
Hadoop at Twitter: scalability and interoperability
Our Hadoop filesystems host over 300PB of data on tens of thousands of servers. We scale HDFS by federating multiple namespaces. This approach allows us to sustain a high HDFS object count (inodes and blocks) without resorting to a single large Java heap size that would suffer from longGC pauses and the inability to use compressed oops. While this approach is great for scaling, it is not easy for us to use because each member namespace in the federation has its own URI. We use ViewFs to provide an illusion of a single namespace within a single cluster. As seen in Figure 1, under the main logical URI we create a ViewFs mount table with links to the appropriate mount point namespaces for paths beginning with /user, /tmp, and /logs, correspondingly.