Restaurant Revenue Prediction Kaggle solution

Predict annual restaurant sales based on objective measurements

With over 1,200 quick cialis generique service restaurants across the globe, TFI is the company behind some of the world’s most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

TFI would love to hire an expert Kaggler like you to head up their growing data science team in Istanbul or Shanghai. You’d be tackling problems like the one featured in this competition on a global scale. See the job description here >>

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. (more…)

Read More

Random Forest Approach using scikit-learn

Here’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = csv_file_object.next() # skip the header
train_data = []
for row in csv_file_object:
    train_data.append(row)  
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest = Forest.fit(train_data[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = test_file_object.next() # skip header row
test_data = []
for row in test_file_object:
    test_data.append(row)  
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')

 

Read More

Install Rabbit MQ 3.3 on CentOS 7

This is a step by step guide to install the Rabbit MQ 3.3.5.1 for the series of topics about AMQP messaging. Rabbit MQ support most of the Linux distributions, Mac OS and MS Windows. I would demonstrate it on CentOS 7 as an example.

1. Install compiler and related if  necessary

# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel autoconf

2. Update latest EPEL

# wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-1.noarch.rpm
# wget http://rpms.famillecollet.com/enterprise/remi-release-7.rpm
# sudo rpm -Uvh remi-release-7*.rpm epel-release-7*.rpm

3. Install Erlang

# wget http://packages.erlang-solutions.com/erlang-solutions-1.0-1.noarch.rpm
# yum install -y erlang

Type “erl” to verfiy if Erlang is installed correctly.

4. Install Rabbit MQ

# wget http://www.rabbitmq.com/releases/rabbitmq-server/v3.3.5/rabbitmq-server-3.3.5-1.noarch.rpm

Add the necessary keys for verification by

# rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc

Install with command

# yum install rabbitmq-server-3.3.5-1.noarch.rpm

Issue command to turn on the web UI plugin

# sudo rabbitmq-plugins enable rabbitmq_management

Change permission

# chown -R rabbitmq:rabbitmq /var/lib/rabbitmq/

Issue following command to start the server (*1)

# /usr/sbin/rabbitmq-server

5. Setup admin user account
In /usr/sbin, create new user “mqadmin” by

# rabbitmqctl add_user mqadmin mqadmin

Issue following command to assign administrator role

 

# rabbitmqctl set_user_tags mqadmin administrator

Issue command ‘rabbitmqctl set_permissions -p / mqadmin “.*” “.*” “.*” ‘ to grant permission

# rabbitmqctl set_permissions -p / mqadmin ".*" ".*" ".*"

Now you can access the web admin by http://host:15672/ (*2)
Remark:

(*1) You may not start the server with command “service rabbitmq-server start” successfully as documented in official manual, please read this link to resolve https://groups.google.com/forum/#!topic/rabbitmq-users/iK3q4GLpHXY

(*2) The default guest/guest user account can only be accessed via localhost

Read More

Xgboost example 1

The purpose of this Vignette is to show you how to use Xgboost to discover and understand your own dataset better.

This Vignette is not about predicting anything (see Xgboost presentation). We will explain how to use Xgboost to highlight the link between the features of your data and the outcome.

Pacakge loading:

require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd') 

VCD package is used for one of its embedded dataset only.

Preparation of the dataset

Numeric VS categorical variables

Xgboost manages only numeric vectors.

What to do when you have categorical data?

A categorical variable has a fixed number of different values. For instance, if a variable called Colour can have only one of these three values, red, blue or green, then Colour is a categorical variable.

In R, a categorical variable is called factor.

Type ?factor in the console for more information.

To answer the question above we will convert categorical variables to numeric one.

(more…)

Read More

How to Install Docker on Ubuntu

Docker is a container-based software framework for automating deployment of applications. “Containers” are encapsulated, lightweight, and portable application modules.

Step 1: Download and Install Docker.

wget -qO- https://get.docker.com/ | sh
Step 2: Add yourself to the docker group
sudo usermod -aG docker <username>

Step 3: Logout and then login to start docker

sudo service  docker start

step 4: Test your docker with some hello-world message.

sudo docker run hello-world

#Hello from Docker.

 

Read More

How to build Hue on Ubuntu Cluster

Here is a step by step guide about how to get up and running Hue.

 

Step1: Fetch Hue source code from github

sudo apt-get install git
 
git clone https://github.com/cloudera/hue.git
cd hue

Step2: Install couple of development packages separately or at once, using command below:

sudo apt-get install ant gcc g++ libkrb5-dev libmysqlclient-dev libssl-dev libsasl2-dev libsasl2-modules-gssapi-mit libsqlite3-dev libtidy-0.99-0 libxml2-dev libxslt-dev make libldap2-dev maven python-dev python-setuptools libgmp3-dev

Step4: Time to build hue

make apps

Step5: Start the development server:

./build/env/bin/hue runserver

and now visit http://127.0.0.1:8000/ !

Read More

Installation of hadoop 2.6 on Mac OS X

After comparing different guides on the internet, I ended up my own partical version base on the Hadoop official guide with manual download.

1. Required software

1) Java

Run the following command in a terminal:

If Java is already installed, you can see a similar result like:
$ java -version
java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

If not, the terminal will prompt you for installation or you can download Java JDK here.

2) SSH

First enable Remote Login in System Preference -> Sharing.

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

2. Get a Hadoop distribution

You can download it from Apache Download Mirror.

3. Prepare to start the Hadoop cluster

(more…)

Read More

Hadoop on top of Mesos Cluster

This post describes how you can set up Apache Hadoop to work with Apache Mesos. So by following these instructions you can have Hadoop running on the same Mesos cluster. The instructions should work for any cluster running CentOs or even different Linux distributions after some small changes.

Prerequisites:

  • I also assume you have already installed Mesos. If not follow the instructions here

 

Step 1: Install Java

### install OpenJDK Runtime Environment (Java SE 8)

sudo yum install java-1.8.0-openjdk

Step 2:  Download hadoop 2.7.1 and extract it

wget http://apache.claz.org/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
tar xvzf hadoop-2.7.1.tar.gz
mv hadoop-2.7.1 hadoop

Step 3: extra

 

 

Read More

Marathon Installation in Mesos Cluster

This post will walk you through setting Marathon in Mesos cluster on CentOS 7.

Step 1: Add the repository

sudo rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm

Step 2: Install

sudo yum -y install marathon

Step 3: Restart Services

sudo service marathon restart

Verifying Installation

If the packages were installed and configured correctly, we should be able to access the Marathon console athttp://<master-ip>:8080 (where <master-ip> is any of the master IP addresses).

 

 

Read More

Mesos Installation in CentOS 7.0

This post will walk through setting up a cluster which includes Apache Mesos.

Highly-available clusters will typically have multiple master nodes and any number of slave nodes. Each master node runs Apache Mesos and ZooKeeper (to provide leader election). Running three ZooKeeper nodes will allow one to fail and for the service to still be available (see ZooKeeper reliability for more information). We recommend running at least three master nodes for a highly-available configuration. Run the steps below on each master node.

Master Node Setup:

The easiest way to install Mesos is via the GitHub repositories. Alternatively, you can download the latest deb or rpm directly from the Mesosphere downloads page and install it manually.

Mesosphere has official package repositories which connect directly to the native package management tools of your favorite Linux distribution — namely apt-get and yum — to install Mesos on top of the most common Linux distributions (RedHat, CentOS, Ubuntu and Debian).

Step 1: Install mesos and zookeeper in master machines:

# Add the repository

$sudo rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm
$sudo yum -y install mesos marathon
$sudo yum -y install mesosphere-zookeeper

Step 2: Configuration
ZooKeeper

Set /var/lib/zookeeper/myid to a unique integer between 1 and 255 on each node.

$sudo nano /var/lib/zookeeper/myid
#set 1

 

Start Zookeeper

$sudo systemctl start zookeeper

 

Step 3: Set mesos master  address in mesos zookeeper setting:

On each node, replacing the IP addresses below with each master’s IP address, set/etc/mesos/zk to:

$sudo nano /etc/mesos/zk
zk://1.1.1.1:2181,2.2.2.2:2181,3.3.3.3:2181/mesos

Step 4: Disable mesos-slave service on each master servers.

$sudo systemctl stop mesos-slave.service
$sudo systemctl disable mesos-slave.service

Step 5:  Restart all the master node using following cmd.

$sudo systemctl restart mesos-master

(more…)

Read More