Lucene In-Memory Search Example and Sample Code

More sample code:

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.util.Version;
public class LuceneTest{
   public static void main(String[] args) {
      // Construct a RAMDirectory to hold the in-memory representation
      // of the index.
      RAMDirectory idx = new RAMDirectory();
      try {
         // Make an writer to create the index
         IndexWriter writer =
                 new IndexWriter(idx, 
                         new StandardAnalyzer(Version.LUCENE_30), 
         // Add some Document objects containing quotes
         writer.addDocument(createDocument("Theodore Roosevelt",
                 "It behooves every man to remember that the work of the " +
                         "critic, is of altogether secondary importance, and that, " +
                         "in the end, progress is accomplished levitra uk supplier uk by the man who does " +
         writer.addDocument(createDocument("Friedrich Hayek",
                 "The case for individual freedom rests largely on the " +
                         "recognition of the inevitable and universal ignorance " +
                         "of all of us concerning a great many of the factors on " +
                         "which the achievements of our ends and welfare depend."));
         writer.addDocument(createDocument("Ayn Rand",
                 "There is nothing to take a man's freedom away from " +
                         "him, save other men. To be free, a man must be free " +
                         "of his brothers."));
         writer.addDocument(createDocument("Mohandas Gandhi",
                 "Freedom is not worth having if it does not connote " +
                         "freedom to err."));
         // Optimize and close the writer to finish building the index
         // Build an IndexSearcher using the in-memory index
         Searcher searcher = new IndexSearcher(idx);
         // Run some queries
         search(searcher, "freedom");
         search(searcher, "free");
         search(searcher, "progress or achievements");
      catch (IOException ioe) {
         // In this example we aren't really doing an I/O, so this
         // exception should never actually be thrown.
      catch (ParseException pe) {
    * Make a Document object with an un-indexed title field and an
    * indexed content field.
   private static Document createDocument(String title, String content) {
      Document doc = new Document();
      // Add the title as an unindexed field...
      doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
      // ...and the content as an indexed field. Note that indexed
      // Text fields are constructed using a Reader. Lucene can read
      // and index very large chunks of text, without storing the
      // entire content verbatim in the index. In this example we
      // can just wrap the content string in a StringReader.
      doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
      return doc;
    * Searches for the given string in the "content" field
   private static void search(Searcher searcher, String queryString)
           throws ParseException, IOException {
      // Build a Query object
      QueryParser parser = new QueryParser(Version.LUCENE_30, 
              new StandardAnalyzer(Version.LUCENE_30));
      Query query = parser.parse(queryString);
      int hitsPerPage = 10;
      // Search for the query
      TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);, collector);
      ScoreDoc[] hits = collector.topDocs().scoreDocs;
      int hitCount = collector.getTotalHits();
      System.out.println(hitCount + " total matching documents");
      // Examine the Hits object to see if there were any matches
      if (hitCount == 0) {
                 "No matches were found for \"" + queryString + "\"");
      } else {
         System.out.println("Hits for \"" +
                 queryString + "\" were found in quotes by:");
         // Iterate over the Documents in the Hits object
         for (int i = 0; i < hitCount; i++) {
            ScoreDoc scoreDoc = hits[i];
            int docId = scoreDoc.doc;
            float docScore = scoreDoc.score;
            System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);
            Document doc = searcher.doc(docId);
            // Print the value that we stored in the "title" field. Note
            // that this Field was not indexed, but (unlike the
            // "contents" field) was stored verbatim and can be
            // retrieved.
            System.out.println("  " + (i + 1) + ". " + doc.get("title"));
            System.out.println("Content: " + doc.get("content"));            





Read More


I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API.Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but the PyLucene code sam­ples for have not been updated for the 3.0 API, only the Java ones. Unfor­tu­nately, there is cur­rently lit­tle (no?) exam­ple PyLucene code in blo­gos­phere. If you have links to more Lucene 3.0 tuto­ri­als and sam­ples, please share them in the comments.


In the spirit of Lingpipe’s Lucene 2.4 in 60 sec­onds, here are rel­e­vant PyLucene 3.0 code snip­pets from my biased-text-sample project, for index­ing and retrieval. (more…)

Read More

Walmart Recruiting II: Sales in Stormy Weather

Predict how sales of weather-sensitive products are affected by snow and rain

Walmart operates 11,450 stores in 27 countries, managing inventory across varying climates and cultures. Extreme weather events, like hurricanes, blizzards, and floods, can have a huge impact on sales at the store and product level.

In their second Kaggle recruiting competition, Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.

Intuitively, we may expect an uptick in the sales of umbrellas before a big thunderstorm, but it’s difficult for replenishment managers to correctly predict the level of inventory needed to avoid being out-of-stock or overstock during and after that storm. Walmart relies on a variety of vendor tools to predict sales around extreme weather events, but it’s an ad-hoc and time-consuming process that lacks a systematic measure of effectiveness.

Helping Walmart better predict sales of weather-sensitive products will keep valued customers out of the rain. It could also earn you a position at one of the most data-driven retailers in the world!

You have been provided with sales data for 111 products whose sales may be affected by the weather (such as milk, bread, umbrellas, etc.). These 111 products are sold in stores at 45 different Walmart locations. Some of the products may be a similar item (such as milk) but have a different id in different stores/regions/suppliers. The 45 locations are covered by 20 weather stations (i.e. some of the stores are nearby and share a weather station).

The competition task is to predict the amount of each product sold around the time of major weather events. For the purposes of this competition, we have defined a weather event as any day in which more than an inch of rain or two inches of snow was observed. You are asked to predict the units sold for a window of ±3 days surrounding each storm.


Read More

Restaurant Revenue Prediction Kaggle solution

Predict annual restaurant sales based on objective measurements

With over 1,200 quick cialis generique service restaurants across the globe, TFI is the company behind some of the world’s most well-known brands: Burger King, Sbarro, Popeyes, Usta Donerci, and Arby’s. They employ over 20,000 people in Europe and Asia and make significant daily investments in developing new restaurant sites.

Right now, deciding when and where to open new restaurants is largely a subjective process based on the personal judgement and experience of development teams. This subjective data is difficult to accurately extrapolate across geographies and cultures.

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

TFI would love to hire an expert Kaggler like you to head up their growing data science team in Istanbul or Shanghai. You’d be tackling problems like the one featured in this competition on a global scale. See the job description here >>

TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. (more…)

Read More

Otto Group Product Classification Challenge 3rd Position solution

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range. (more…)

Read More

Random Forest Approach using scikit-learn

Here’s my Python code to make the prediction. The training data and the testing data are massaged so that String values have been converted to integer. And the output is a single list of survived value (I removed the passenger id from all the files because it will not be used in the analyzing, so remember to add the passenger id back when doing a submission.) There are some discrete features in the data, e.g. sex, pclass, sibling, parch, etc, and some non-discrete features, e.g. age, fare. I think it will be a key to improve to fine tune boundaries of the non-discrete features. Anyway, it is a beginning, I need have more insight in the data set to get a higher score. Finally, note that the submission validation has changed lately too.

from sklearn.ensemble import RandomForestClassifier
import csv as csv, numpy as np
csv_file_object = csv.reader(open('./train_sk.csv', 'rb'))
train_header = # skip the header
train_data = []
for row in csv_file_object:
train_data = np.array(train_data)
Forest = RandomForestClassifier(n_estimators = 100)
Forest =[0::, 1::], train_data[0::, 0])
test_file_object = csv.reader(open('./test_sk.csv', 'rb'))
test_header = # skip header row
test_data = []
for row in test_file_object:
test_data = np.array(test_data)
output = Forest.predict(test_data)
output = output.astype(int)
np.savetxt('./output.csv', output, delimiter=',')


Read More

Installing Python scikit-learn package

scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

Step 1: Dependencies

First, we need to install all dependencies for it:

# yum update && yum install scipy.x86_64 numpy.x86_64 python-devel.x86_64 python-matplotlib.x86_64 python-pip.noarch gcc-c++.x86_64

Step 2: Installing scikit-learn using pip

Now, I will use the fastest way to install the package using pip:

# pip-python install scikit-learn

This command should give a output very similar to this:

# warning: no files found matching '' # warning: no files found matching '*.TXT' under directory 'sklearn/datasets' # Installing /usr/lib64/python2.7/site-packages/scikit_learn-0.9-py2.7- # Successfully installed scikit-learn Cleaning up...

Step 3: Runing the examples

The final step is to run several examples provided here


Read More

Xgboost example 1

The purpose of this Vignette is to show you how to use Xgboost to discover and understand your own dataset better.

This Vignette is not about predicting anything (see Xgboost presentation). We will explain how to use Xgboost to highlight the link between the features of your data and the outcome.

Pacakge loading:

if (!require('vcd')) install.packages('vcd') 

VCD package is used for one of its embedded dataset only.

Preparation of the dataset

Numeric VS categorical variables

Xgboost manages only numeric vectors.

What to do when you have categorical data?

A categorical variable has a fixed number of different values. For instance, if a variable called Colour can have only one of these three values, red, blue or green, then Colour is a categorical variable.

In R, a categorical variable is called factor.

Type ?factor in the console for more information.

To answer the question above we will convert categorical variables to numeric one.


Read More