Retail is one of the most important business domains for data science and data mining applications because of its prolific data and numerous optimization problems such as optimal prices, discounts, recommendations, and stock levels that can be solved using data analysis methods. The rise of omni-channel retail that integrates marketing, customer relationship management, and inventory management across all online and offline channels has produced a plethora of correlated data which increases both the importance and capabilities of data-driven decisions.

Although there are many books on data mining in general and its applications to marketing and customer relationship management in particular [BE11, AS14, PR13 etc.], most of them are structured as data scientist manuals focusing on algorithms and methodologies and assume that human decisions play a central role in transforming analytical findings into business actions. In this article we are trying to take a more rigorous approach and provide a systematic view of econometric models and objective functions that can leverage data analysis to make more automated decisions. With this paper, we want to describe a hypothetical revenue management platform that consumes a retailer’s data and controls different aspects of the retailer’s strategy such as pricing, marketing, and inventory:



Read More


The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, Apache Spark, and Apache Tez appeared and joined the army of Big Data and NoSQL systems. This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing fault-tolerance and strict transactionality i.e. none of these events can be lost or duplicated. This system has been designed to supplement and succeed the existing Hadoop-based system that had too high latency of data processing and too high maintenance costs. The requirements and the system itself were so generic and typical that we describe it below as a canonical model, just like an abstract problem statement. (more…)

Read More

Spark and Storm face new competition for real-time Hadoop processing

Real-time processing of streaming data in Hadoop typically comes down to choosing between two projects: Storm or Spark. But a third contender, which has been open-sourced from a formerly commercial-only offering, is about to enter the race, and like those components, it may have a future outside of Hadoop.

DataTorrent RTS (real-time streaming) has long been a commercial offering for live data processing apart from the family of Apache Foundation open source projects around Hadoop. But now DataTorrent (the company) is preparing to open-source the core DataTorrent RTS engine, offer it under the same Apache 2.0 licensing as its competitors, and eventually contribute it to the Apache Foundation for governance.

Built for business

Project Apex, as the open source version of DataTorrent RTS’s engine is to be called, is meant to not only compete with Storm and Spark but to be superior to them — to run faster (10 to 100 times faster than Spark, it’s claimed), to be easier to program, to better support enterprise needs like fault tolerance and scalability, and to make it easier to demonstrate the value of Hadoop to a business owner. (more…)

Read More

Walmart Recruiting II: Sales in Stormy Weather

Predict how sales of weather-sensitive products are affected by snow and rain

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it’s difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it’s an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world!

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.


Read More

Restaurant Revenue Prediction Kaggle solution

Predict annual restaurant sales based on objective measurements

With over 1,200 quick cialis generique service restaurants across the globe, TFI is the company behind some of the world’s most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

TFI would love to hire an expert Kaggler like you to head up their growing data science team in Istanbul or Shanghai. You’d be tackling problems like the one featured in this competition on a global scale. See the job description here >>

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. (more…)

Read More

Otto Group Product Classification Challenge 3rd Position solution

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range. (more…)

Read More

Random Forest Approach using scikit-learn

Here’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = # skip the header
train_data = []
for row in csv_file_object:
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest =[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = # skip header row
test_data = []
for row in test_file_object:
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')


Read More

Install Rabbit MQ 3.3 on CentOS 7

This is a step by step guide to install the Rabbit MQ for the series of topics about AMQP messaging. Rabbit MQ support most of the Linux distributions, Mac OS and MS Windows. I would demonstrate it on CentOS 7 as an example.

1. Install compiler and related if  necessary

# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel autoconf

2. Update latest EPEL

# wget
# wget
# sudo rpm -Uvh remi-release-7*.rpm epel-release-7*.rpm

3. Install Erlang

# wget
# yum install -y erlang

Type “erl” to verfiy if Erlang is installed correctly.

4. Install Rabbit MQ

# wget

Add the necessary keys for verification by

# rpm --import

Install with command

# yum install rabbitmq-server-3.3.5-1.noarch.rpm

Issue command to turn on the web UI plugin

# sudo rabbitmq-plugins enable rabbitmq_management

Change permission

# chown -R rabbitmq:rabbitmq /var/lib/rabbitmq/

Issue following command to start the server (*1)

# /usr/sbin/rabbitmq-server

5. Setup admin user account
In /usr/sbin, create new user “mqadmin” by

# rabbitmqctl add_user mqadmin mqadmin

Issue following command to assign administrator role


# rabbitmqctl set_user_tags mqadmin administrator

Issue command ‘rabbitmqctl set_permissions -p / mqadmin “.*” “.*” “.*” ‘ to grant permission

# rabbitmqctl set_permissions -p / mqadmin ".*" ".*" ".*"

Now you can access the web admin by http://host:15672/ (*2)

(*1) You may not start the server with command “service rabbitmq-server start” successfully as documented in official manual, please read this link to resolve!topic/rabbitmq-users/iK3q4GLpHXY

(*2) The default guest/guest user account can only be accessed via localhost

Read More

Installing Python scikit-learn package

scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

Step 1: Dependencies

First, we need to install all dependencies for it:

# yum update && yum install scipy.x86_64 numpy.x86_64 python-devel.x86_64 python-matplotlib.x86_64 python-pip.noarch gcc-c++.x86_64

Step 2: Installing scikit-learn using pip

Now, I will use the fastest way to install the package using pip:

# pip-python install scikit-learn

This command should give a output very similar to this:

# warning: no files found matching '' # warning: no files found matching '*.TXT' under directory 'sklearn/datasets' # Installing /usr/lib64/python2.7/site-packages/scikit_learn-0.9-py2.7- # Successfully installed scikit-learn Cleaning up...

Step 3: Runing the examples

The final step is to run several examples provided here


Read More