Research projects

Text QA

The Question Answering component answers a question based on a given context (e.g, a paragraph of text), where the answer to the question is a segment of the context. This component allows you to answer questions based on your documentation.

Project / code based on Google Research paper :
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

Demo Text QA


Navioo :Name Entity Recognition (word embeddings from BERT)

Named Entity Recognition (NER) classifies tokens in text into predefined categories (tags), such as person names, quantity expressions, percentage expressions, names of locations, organizations, as well as expression of time, currency and others. We can recognize up to 19 entities. Navioo also features a multilingual model that is available for 104 languages. NER can be used as a knowledge extractor when you are interested in a piece of certain information in your text.

Demo Ner


Article Extraction

Extract the main body of an article including embedded media such as links, images, videos etc. from any URL or Webpage.

Demo Article Extraction


Summarization

Summarization allows you to take the important, relevant points and topics from a piece of text, making it easier to consume and analyze. .

Demo Summarization


Deep Learning research - Multitask Learning

Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval:

  1. Multitask Learning - The extra tasks force the network to learn internal complex representations in order to approximate both the primary and extra functions.
  2. MTL and Backpropagation - Adding tasks tends to increase the effective learning rate on the input-to-hidden layer.

The unsupervised and semi-supervised learning algorithms can be read in any order (the auto-encoders can be read independently of the RBM/DBN thread):

  • Deep Multi-Task Learning with Shared Memory - two deep architectures which can be trained jointly on multiple related tasks
  • Stacked Denoising Auto-Encoders - unsupervised pre-training for deep nets
  • Learning Multiple Tasks with Deep Relationship Networks - Deep Relationship Networks (DRN) that discover the task relationship based on novel tensor normal priors over the parameter tensors of multiple task-specific layers in deep convolutional networks.
  • Deep Belief Networks - unsupervised generative pre-training of stacked RBMs followed by supervised fine-tuning

 

TextExtractor

Algorithms to detect and remove the surplus "clutter" around the main textual content of a web page. In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This TextExtractor text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the TextExtractor creation process. With the help of our model, we also quantify the impact of TextExtractor removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.

 

TEKI-Information Extraction

TEKI is a program that automatically identifies and extracts binary relationships from English sentences. TEKI is designed for Web-scale information extraction, where the target relations cannot be specified in advance and speed is important.TEKI uses the following code and data, which are included in the release:

AppCrawler

The goal of the project is to develop tools to open up mobile app black-boxes: for crawling data deep inside apps and for understanding user intents while using apps. We also aim to use the information for better search and advertisement scenarios.

 

Prediction Engines

Research around information aggregation and prediction, including polls, probability elicitation, and prediction markets.These methods, broadly defined as wisdom of the crowds, are utilized for a range of outcomes: elections, marketing, internal corporate, military intelligence, etc. We demonstrate some serious advances.

(1) Combinatorial Prediction Markets: frontend, backened, and unique questions.

(2) Experimental Prediction Markets and Polling.

(3) Forecasts, Sentiment, and Data Analytics

 

Recurrent Neural Network Language Modeling

This project focuses on advancing the state-of-the-art in RNN langauge modeling. A special interest in is adding side-channels of information as input, to model phenomena which are not easily handled in other frameworks.

 

WebSensor

Make sense of *BIG* Web data is facing 4Vs problems: big Volume, high Velocity, high Variety, and unknown Veracity. We propose to build a virtual layer of WebSensor on the Web. A WebSensor is a programmable “focused crawler” that continuously discover, extract and aggregate structured information around a topic.