NiFi: Thinking Differently About DataFlow

Recently a question was posed to the Apache NiFi (Incubating) Developer Mailing List about how best to use Apache NiFi to perform Extract, Transform, Load (ETL) types of tasks. The question was “Is it possible to have NiFi service setup and running and allow for multiple dataflows to be designed and deployed (running) at the same time?”

The idea here was to create several disparate dataflows that run alongside one another in parallel. Data comes from Source X and it’s processed this way. That’s one dataflow. Other data comes from Source Y and it’s processed this way. That’s a second dataflow entirely. Typically, this is how we think about dataflow when we design it with an ETL tool. And this is a pretty common question for new NiFi users. With NiFi, though, we tend to think about designing dataflows a little bit differently. Rather than having several disparate, “stovepiped” flows, the preferred approach with NiFi is to have several inputs feed into the same dataflow. Data can then be easily routed (via RouteOnAttribute, for example) to “one-off subflows” if need be.

One of the benefits to having several disparate dataflows, though, is that it makes it much easier to answer when someone comes to you and says “I sent you a file last week. What did you do with it?” or “How do you process data that comes from this source?” You may not know exactly what happened to a specific file that they sent you, step-by-step, because of the different decision points in the flow, but at least you have a good idea by looking at the layout of the dataflow. (more…)

Read More

Apache Nifi (aka HDF) data flow across data center

Short Description:

This article provides a step by step overview of how to setup cross data center data flow using Apache Nifi.


Traditionally enterprises have been dealing with data flows or data movement within their data centers. But as the world has become more flattened and global presence of companies has become a norm, enterprises are faced with the challenge of collecting and connecting data from their global footprint. This problem was daunting NSA a decade ago and they came up with a solution for this using a product which was later named as Apache Nifi.

Apache nifi is a easy to use, powerful, and reliable system to process and distribute data. Within Nifi, as you will see, I will be able to build a global data flow with minimal to no Coding. You can learn the details about Nifi from Apache Nifi website. This is one of most well documented Apache projects.

The focus of this article to just look at one specific feature within Nifi that I believe no other software product does it as well as Nifi. And this feature is “site to site” protocol data transfer.

Business use case

One of the classic business problem is to push data from a location that has a small IT footprint, to the main data center, where all the data is collected and connected. This small IT footprint could be a oil rig at the middle of the ocean, a small bank location at a remote mountain in a town, a sensor on a vehicle so on and so forth. So, your business wants a mechanism to push the data generated at various location to say Headquarters in a reliable fashion, with all the bells and whistles of an enterprise data flow which means maintain lineage, secure, provenance, audit, ease of operations etc.

The data that’s generated at my sources are of various formats such as txt, csv, json, xml, audio, image etc.. and they could of various size ranges from few MBs to GBs. I wanted to break these files into smaller chunks as I have a low bandwidth at my source data centers and want to stich them together at the destination and load that into my centralized Hadoop data lake.

Solution Architecture

Apache Nifi (aka Hortonworks Data Flow) is a perfect tool to solve this problem. The overall architecture looks something like Fig 1.

We have a Australian & Russian data center from where we want to move the data to US Headquarters. We will have what we call edge instance of nifi that will be sitting in Australian & Russian data center, that will act as a data acquisition points. We will then have a Nifi processing cluster in US where we will receive and process all these data coming from global location. We will build this end to end flow without any coding but rather by just a drag and drop GUI interface.

Build the data flow

Here are the high level steps to build the overall data flow.

Step1) Setup a Nifi instance at Australian data center that will act as data acquisition instance. I will create a local instance of Nifi that will act as my Australian data center.

Step2) Setup Nifi instance on a CentOS based virtual machine that will act as my Nifi data processing instance. This could be cluster of Nifi as well but, in my case it will be just a single instance.

Step3) Build Nifi data flow for the processing instance. This will have an input port that will indicate that this instance can accept data from other Nifi instances.

Step4) Build Nifi data for the data acquisition instance. This will have a “remote process group” that will talk to the Nifi data processing instance via site-to-site protocol.

Step5) Test out the overall flow.

Attached is the document that provides detailed step by step instruction on how to set this up.


Read More

Accurately Measuring Model Prediction Error

When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, however, techniques of measuring error are used that give grossly misleading results. This can lead to the phenomenon of over-fitting where a model may fit the training data very well, but will do a poor job of predicting results for new data not used in model training. Here is an overview of methods to accurately measure model prediction error.

Measuring Error

When building prediction models, the primary goal should be to make a model that most accurately predicts the desired target value for new data. The measure of model error that is used should be one that achieves this goal. In practice, however, many modelers instead report a measure of model error that is based not on the error for new data but instead on the error the very same data that was used to train the model. The use of this incorrect error measure can lead to the selection of an inferior and inaccurate model.

Naturally, any model is highly optimized for the data it was trained on. The expected error the model exhibits on new data will always be higher than that it exhibits on the training data. As example, we could go out and sample 100 people and create a regression model to predict an individual’s happiness based on their wealth. We can record the squared error for how well our model does on this training set of a hundred people. If we then sampled a different 100 people from the population and applied our model to this new group of people, the squared error will almost always be higher in this second case.

It is helpful to illustrate this fact with an equation. We can develop a relationship between how well a model predicts on new data (its true prediction error and the thing we really care about) and how well it predicts on the training data (which is what many modelers in fact measure).

True Prediction Error=Training Error+Training OptimismTrue Prediction Error=Training Error+Training Optimism

Here, Training Optimism is basically a measure of how much worse our model does on new data compared to the training data. The more optimistic we are, the better our training error will be compared to what the true error is and the worse our training error will be as an approximation of the true error.

The Danger of Overfitting

In general, we would like to be able to make the claim that the optimism is constant for a given training set. If this were true, we could make the argument that the model that minimizes training error, will also be the model that will minimize the true prediction error for new data. As a consequence, even though our reported training error might be a bit optimistic, using it to compare models will cause us to still select the best model amongst those we have available. So we could in effect ignore the distinction between the true error and training errors for model selection purposes.

Unfortunately, this does not work. It turns out that the optimism is a function of model complexity: as complexity increases so does optimism. Thus we have a our relationship above for true prediction error becomes something like this:

True Prediction Error=Training Error+f(Model Complexity)True Prediction Error=Training Error+f(Model Complexity)

How is the optimism related to model complexity? As model complexity increases (for instance by adding parameters terms in a linear regression) the model will always do a better job fitting the training data. This is a fundamental property of statistical models 1. In our happiness prediction model, we could use people’s middle initials as predictor variables and the training error would go down. We could use stock prices on January 1st, 1990 for a now bankrupt company, and the error would go down. We could even just roll dice to get a data series and the error would still go down. No matter how unrelated the additional factors are to a model, adding them will cause training error to decrease.

But at the same time, as we increase model complexity we can see a change in the true prediction accuracy (what we really care about). If we build a model for happiness that incorporates clearly unrelated factors such as stock ticker prices a century ago, we can say with certainty that such a model must necessarily be worse than the model without the stock ticker prices. Although the stock prices will decrease our training error (if very slightly), they conversely must also increase our prediction error on new data as they increase the variability of the model’s predictions making new predictions worse. Furthermore, even adding clearly relevant variables to a model can in fact increase the true prediction error if the signal to noise ratio of those variables is weak.

Let’s see what this looks like in practice. We can implement our wealth and happiness model as a linear regression. We can start with the simplest regression possible where Happiness=a+b Wealth+ϵHappiness=a+b Wealth+ϵ and then we can add polynomial terms to model nonlinear effects. Each polynomial term we add increases model complexity. So we could get an intermediate level of complexity with a quadratic model like Happiness=a+b Wealth+c Wealth2+ϵHappiness=a+b Wealth+c Wealth2+ϵ or a high-level of complexity with a higher-order polynomial like

Happiness=a+b Wealth+c Wealth2+d Wealth3+e Wealth4+f Wealth5+g Wealth6+ϵHappiness=a+b Wealth+c Wealth2+d Wealth3+e Wealth4+f Wealth5+g Wealth6+ϵ.


The figure below illustrates the relationship between the training error, the true prediction error, and optimism for a model like this. The scatter plots on top illustrate sample data with regressions lines corresponding to different levels of model complexity.

Training, optimism and true prediction error.

Increasing the model complexity will always decrease the model training error. At very high levels of complexity, we should be able to in effect perfectly predict every single point in the training data set and the training error should be near 0. Similarly, the true prediction error initially falls. The linear model without polynomial terms seems a little too simple for this data set. However, once we pass a certain point, the true prediction error starts to rise. At these high levels of complexity, the additional complexity we are adding helps us fit our training data, but it causes the model to do a worse job of predicting new data.

This is a case of overfitting the training data. In this region the model training algorithm is focusing on precisely matching random chance variability in the training set that is not present in the actual population. We can see this most markedly in the model that fits every point of the training data; clearly this is too tight a fit to the training data.

Preventing overfitting is a key to building robust and accurate prediction models. Overfitting is very easy to miss when only looking at the training error curve. To detect overfitting you need to look at the true prediction error curve. Of course, it is impossible to measure the exact true prediction curve (unless you have the complete data set for your entire population), but there are many different ways that have been developed to attempt to estimate it with great accuracy. The second section of this work will look at a variety of techniques to accurately estimate the model’s true prediction error.

An Example of the Cost of Poorly Measuring Error

Let’s look at a fairly common modeling workflow and use it to illustrate the pitfalls of using training error in place of the true prediction error 2. We’ll start by generating 100 simulated data points. Each data point has a target value we are trying to predict along with 50 different parameters. For instance, this target value could be the growth rate of a species of tree and the parameters are precipitation, moisture levels, pressure levels, latitude, longitude, etc. In this case however, we are going to generate every single data point completely randomly. Each number in the data set is completely independent of all the others, and there is no relationship between any of them.

For this data set, we create a linear regression model where we predict the target value using the fifty regression variables. Since we know everything is unrelated we would hope to find an R2 of 0. Unfortunately, that is not the case and instead we find an R2 of 0.5. That’s quite impressive given that our data is pure noise! However, we want to confirm this result so we do an F-test. This test measures the statistical significance of the overall regression to determine if it is better than what would be expected by chance. Using the F-test we find a p-value of 0.53. This indicates our regression is not significant.

If we stopped there, everything would be fine; we would throw out our model which would be the right choice (it is pure noise after all!). However, a common next step would be to throw out only the parameters that were poor predictors, keep the ones that are relatively good predictors and run the regression again. Let’s say we kept the parameters that were significant at the 25% level of which there are 21 in this example case. Then we rerun our regression.

In this second regression we would find:

  • An R2 of 0.36
  • A p-value of 5*10-4
  • 6 parameters significant at the 5% level

Again, this data was pure noise; there was absolutely no relationship in it. But from our data we find a highly significant regression, a respectable R2 (which can be very high compared to those found in some fields like the social sciences) and 6 significant parameters!

This is quite a troubling result, and this procedure is not an uncommon one but clearly leads to incredibly misleading results. It shows how easily statistical processes can be heavily biased if care to accurately measure error is not taken.


Read More


A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Forecasting is the use of a model to predict future based on past informations. This problem is a bit different to what most known as the pure cross-sectional problem.

In this post, I take the recent Kaggle challenge as example, sharing the finding and tricks I used. The competition – Rossmann Store Sales – attracted 3,738 data scientists, making it the second most popular competition by participants ever. Rossmann is a drug store giant operates over 3,000 stores in European, who challenged Kagglers to forecast 6 weeks of daily sales for 1,115 stores. The data is mainly comprised of store index, store type, competitor store information, holiday event, promotion event, whether store open, customers, and the sales which is what we’re tasked to predict.

Doing time series forecasting, a few things specific to time-series you need to know about are

  1. time-dependent features.
  2. validating by time.

I will walk thorough them latter. (more…)

Read More

Getting Started with Markov Chains

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data,mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Modelsandmcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPackand the relatively recent package,  markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain.  This first section of code replicates the Oz transition probability matrix from section 11.1 and uses theplotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.


Read More