Research projects

Machine Learning research -Deep Learning

The purely supervised learning algorithms are meant to be read in order:

  1. Logistic Regression - using Theano for something simple
  2. Multilayer perceptron - introduction to layers
  3. Deep Convolutional Network

The unsupervised and semi-supervised learning algorithms can be read in any order (the auto-encoders can be read independently of the RBM/DBN thread):

  • Auto Encoders, Denoising Autoencoders - description of autoencoders
  • Stacked Denoising Auto-Encoders - easy steps into unsupervised pre-training for deep nets
  • Restricted Boltzmann Machines - single layer generative RBM model
  • Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning



Algorithms to detect and remove the surplus "clutter" around the main textual content of a web page. In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This TextExtractor text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the TextExtractor creation process. With the help of our model, we also quantify the impact of TextExtractor removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.


TEKI-Information Extraction

TEKI is a program that automatically identifies and extracts binary relationships from English sentences. TEKI is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.TEKI uses the following code and data, which are included in the release:


The goal of the project is to develop tools to open up mobile app black-boxes: for crawling data deep inside apps and for understanding user intents while using apps. We also aim to use the information for better search and advertisement scenarios.


Prediction Engines

Research around information aggregation and prediction, including polls, probability elicitation, and prediction markets.These methods, broadly defined as wisdom of the crowds, are utilized for a range of outcomes: elections, marketing, internal corporate, military intelligence, etc. We demonstrate some serious advances.

(1) Combinatorial Prediction Markets: frontend, backened, and unique questions.

(2) Experimental Prediction Markets and Polling.

(3) Forecasts, Sentiment, and Data Analytics


Recurrent Neural Network Language Modeling

This project focuses on advancing the state-of-the-art in RNN langauge modeling. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks.



Make sense of *BIG* Web data is facing 4Vs problems: big Volume, high Velocity, high Variety, and unknown Veracity. We propose to build a virtual layer of WebSensor on the Web. A WebSensor is a programmable “focused crawler” that continuously discover, extract and aggregate structured information around a topic.