33rd Annual ACM SIGIR Conference

Tutorials



The tutorial on Multimedia Information Retrieval has been cancelled

All Tutorials will run on Monday, July 19th



Tutorial Chair :Djoerd Hiemstra, University of Twente, The Netherlands

Full-Day



Low Cost Evaluation in Information Retrieval

Full-day

Presenters
  • Carterette (U. Delaware)
  • Kanoulas (U. Sheffield)
  • Yilmaz (Microsoft)
Abstract

Search corpora are growing larger and larger: over the last 10 years, the IR research community has moved from the several hundred thousand documents on the TREC disks to the tens of millions of U.S. government web pages of GOV2 to the one bil¬lion general-interest web pages in the new ClueWeb09 collec¬tion. But traditional means of acquiring relevance judgments and evaluating – e.g. pooling documents to calculate average preci¬sion – do not seem to scale well to these new large collections. They require substantially more cost in human assessments for the same reliability in evaluation; if the additional cost goes over the assessing budget, errors in evaluation are inevitable. Some alternatives to pooling that support low-cost and reliable evaluation have recently been proposed. A number of them have already been used in TREC and other evaluation forums (TREC Million Query, Legal, Chemical, Web, Relevance Feedback Tracks, CLEF Patent IR, INEX). Evaluation via implicit user feedback (e.g. clicks) and crowdsourcing have also recently gained attention in the community. Thus it is important that the methodologies, the analysis they support, and their strengths and weaknesses are well-understood by the IR community. Furthermore, these approaches can support small research groups at¬tempting to start investigating new tasks on new corpora with rela¬tively low cost. Even groups that do not participate in TREC, CLEF, or other evaluation conferences can benefit from understanding how these methods work, how to use them, and what they mean as they build test collections for tasks they are interested in. The goal of this tutorial is to provide attendees with a comprehensive overview of techniques to perform low cost (in terms of judgment effort) evaluation. A number of topics will be covered, including alternatives to pooling, evaluation measures robust to incomplete judgments, evaluating with no relevance judgments, statistical inference of evaluation metrics, inference of relevance judgments, query selection, techniques to test the reliability of the evaluation and reusability of the constructed collections. The tutorial should be of interest to a wide range of attendees. Those new to the field will come away with a solid under¬standing of how low cost evaluation methods can be applied to construct inexpensive test collections and evaluate new IR tech¬nology, while those with intermediate knowledge will gain deeper insights and further understand the risks and gains of low cost evaluation. Attendees should have a basic knowledge of the traditional evaluation framework (Cranfield) and metrics (such as average precision and nDCG), along with some basic knowledge on probability theory and statistics. More advanced concepts will be explained during the tutorial.



Morning



Learning to Rank for Information Retrieval

Half-day - Morning

Presenter
  • Liu (Microsoft)
Abstract

This tutorial is concerned with a comprehensive introduction to the research area of learning to rank for information retrieval. In the first part of the tutorial, we will introduce three major approaches to learning to rank, i.e., the pointwise, pairwise, and listwise approaches, analyze the relationship between the loss functions used in these approaches and the widely-used IR evaluation measures, evaluate the performance of these approaches on the LETOR benchmark datasets, and demonstrate how to use these approaches to solve real ranking applications. In the second part of the tutorial, we will discuss some advanced topics regarding learning to rank, such as relational ranking, diverse ranking, semi-supervised ranking, transfer ranking, query-dependent ranking, and training data preprocessing. In the third part, we will briefly mention the recent advances on statistical learning theory for ranking, which explain the generalization ability and statistical consistency of different ranking methods. In the last part, we will conclude the tutorial and show several future research directions.


Introduction to Probabilistic Models in IR

Half-day - Morning

Presenter
  • Lavrenko (U. Edinburgh)
Abstract

Most of today's state-of-the-art retrieval models, including BM25 and language modeling, are grounded in probabilistic principles. Having a working understanding of these principles can help researchers understand existing retrieval models better and also provide industrial practitioners with an understanding of how such models can be applied to real world problems. This half-day tutorial will cover the fundamentals of two dominant probabilistic frameworks for Information Retrieval: the classical probabilistic model and the language modeling approach. The elements of the classical framework will include the probability ranking principle, the binary independence model, the 2-Poisson model, and the widely used BM25 model. Within language modeling framework, we will discuss various distributional assumptions and smoothing techniques. Special attention will be devoted to the event spaces and independence assumptions underlying each approach. The tutorial will outline several techniques for modeling term dependence and addressing vocabulary mismatch. We will also survey applications of probabilistic models in the domains of cross-language and multimedia retrieval. The tutorial will conclude by suggesting a set of open problems in probabilistic models of IR. Attendees should have a basic familiarity with probability and statistics. A brief refresher of basic concepts, including random variables, event spaces, conditional probabilities, and independence will be given at the beginning of the tutorial. In addition to slides, some hands on exercises and examples will be used throughout the tutorial.


Multimedia information Retrieval

Half-day - Morning This Tutorial has been cancelled

Presenter
  • Rueger (Open University)

Web Retrieval: The Role of Users

Half-day - Morning

Presenters
  • Baeza-Yates (Yahoo)
  • Maarek (Yahoo)
Abstract

Web retrieval methods have evolved through three major steps in the last decade or so. They started from standard document-centric IR in the early days of the Web, then made a major step forward by leveraging the structure of the Web, using link analysis techniques in both crawling and ranking challenges. A more recent, no less important but maybe more discrete step forward, has been to enter the user in this equation in two ways: (1) implicitly, through the analysis of usage data captured by query logs, and session and click information in general, the goal being to improve ranking as well as to measure user's happiness and engagement; (2) explicitly, by offering novel interactive features; the goal here being to better answer users' needs. In this tutorial, we will cover the user-related challenges associated with the implicit and explicit role of users in Web retrieval. We will review and discuss challenges associated with two types of activities, namely:

  • Usage data analysis and metrics - It is critical to monitor how users interact with Web retrieval systems, as this implicit relevant feedback aggregated at a large scale can approximate quite accurately the level of success of a given feature. Here we have to consider not only clicks statistics but also the time spent in a page, the number of actions per session, etc.
  • User interaction - Given the intrinsic problems posed by the Web, the key challenge for the user is to conceive a good query, one that leads to a manageable and relevant answer. The retrieval system must complete search requests fast and give back relevant results, even for poorly formulated queries. Web retrieval engines thus interact with the user at two key stages, each associated with its own challenges: Expressing a query: Human beings have needs or tasks to accomplish, which are frequently not easy to express as “queries”. Queries are just a reflection of human needs and are thus, by definition, imperfect. The issue here is for the engine both to assist the user in reflecting this need and to capture it accurately even if the information is incomplete or poorly expressed. Interpreting results: Even if the user is able to perfectly express a query, the answer might be split over thousands or millions of Web pages or not exist at all. In this context, numerous questions need to be addressed. Examples include: How do we handle a large answer? How do we rank results? How do we select the documents that really are of interest to the user? Even in the case of a single document candidate, the document itself could be large. How do we browse such documents efficiently?

The goal of this tutorial is to teach the key principles and technologies behind the activities and challenges briefly outlined above, bring new understanding and insights to the attendees, and hopefully foster future research.


Information Retrieval Challenges in Computational Advertising

Half-day - Morning

Presenters
  • Broder (Yahoo)
  • Josifovski (Yahoo)
  • Gabrilovich (Yahoo)
Abstract

The aim of this tutorial is to present the state of the art in the emerging area of Computational Advertising, and to expose the participants to the main research challenges in this exciting field. The tutorial does not assume any prior knowledge of Web advertising, and will begin with a comprehensive background survey of the topic. In this tutorial, we focus on one important aspect of online advertising, namely, using the user context to retrieve relevant ads. It is essential to emphasize that in most cases the context of user actions is defined by a body of text, hence the ad matching problem lends itself to many IR methods. At first approximation, the process of obtaining relevant ads can be reduced to conventional information retrieval, where one constructs a query that describes the user’s context, and then executes this query against a large inverted index of ads. We show how to augment the standard information retrieval approach using query expansion and text classification techniques. We demonstrate how to employ a relevance feedback assumption and use Web search results retrieved by the query. This step allows one to use the Web as a repository of relevant query-specific knowledge. We also go beyond the conventional bag of words indexing, and construct additional features using a large external taxonomy and a lexicon of named entities obtained by analyzing the entire Web as a corpus. Computational advertising poses numerous challenges and open research problems in text summarization, natural language generation, named entity recognition, computer-human interaction, and others. The last part of the tutorial will be devoted to recent research results as well as open problems, such as automatically classifying cases when no ads should be shown, handling geographic names, context modeling for vertical portals, and using natural language generation to automatically create advertising campaigns.



Afternoon



Extraction of Open-Domain Class Attributed from Text: Building Blocks for Faceted Search

Half-day - Afternoon

Presenter
  • Marius Pasca (Google)
Abstract

Knowledge automatically extracted from text captures instances, classes of instances and relations among them. In particular, the acquisition of class attributes (e.g., “top speed”, “body style” and “number of cylinders” for the class of “sports cars”) from text is a particularly appealing task and has received much attention recently, given its natural fit as a building block towards the far-reaching goal of constructing knowledge bases from text. This tutorial provides an overview of extraction methods developed in the area of Web-based information extraction, with the purpose of acquiring attributes of open-domain classes. The attributes are extracted for classes organized either as a flat set or hierarchically. The extraction methods operate over unstructured or semi-structured text available within collections of Web documents, or over relatively more intriguing data sources consisting of anonymized search queries. The methods take advantage of weak supervision provided in the form of seed examples or small amounts of annotated data, or draw upon knowledge already encoded within human-compiled resources (e.g., Wikipedia). The more ambitious methods, aiming at acquiring as many accurate attributes from text as possible for hundreds or thousands of classes covering a wide range of domains of interest, need to be designed to scale to Web collections. This restriction has significant consequences on the overall complexity and choice of underlying tools, in order for the extracted attributes to ultimately aid information retrieval in general and Web search in particular, by producing relevant attributes for open-domain classes, along with other types of relations among instances or among classes.

Bio:

Marius Pasca is a Senior Research Scientist at Google. He graduated with a Ph.D. degree in Computer Science from Southern Methodist University in Dallas, Texas and an M.Sc. degree in Computer Science from Joseph Fourier University in Grenoble, France. Current research interests include factual information extraction from unstructured text and its applications to Web search.


From federated to aggregated search

Half-day - Afternoon

Presenters
  • Diaz (Yahoo)
  • Lalmas (U. Glasgow)
  • Shokouhi (Microsoft)
Abstract

Federated search refers to the brokered retrieval of content from a set of auxiliary retrieval systems instead of from a single, centralized retrieval system. Federated search tasks occur in, for example, digital libraries (where documents from several retrieval systems must be seamlessly merged) or peer-to-peer information retrieval (where documents distributed across a network of local indexes must be retrieved). In the context of web search, aggregated search refers to the integration of non-web content (e.g. images, videos, news articles, maps, tweets) into a web search result page. This is in contrast with classic web search where users are presented with a ranked list consisting exclusively of general web documents. As in other federated search situations, the non-web content is often retrieved from auxiliary retrieval systems (e.g. image or video databases, news indexes). Although aggregated search can be seen as an instance of federated search, several aspects make aggregated search a unique and compelling research topic. These include large sources of evidence (e.g. click logs) for deciding what non-web items to return, constrained interfaces (e.g. mobile screens), and a very heterogeneous set of available auxiliary resources (e.g. images, videos, maps, news articles). Each of these aspects introduces problems and opportunities not addressed in the federated search literature. Aggregated search is an important future research direction for information retrieval. All major search engines now provide aggregated search results. As the number of available auxiliary resources grows, deciding how to effectively surface content from each will become increasingly important. The goal of this tutorial is to provide an overview of federated search and aggregated search techniques for an intermediate information retrieval researcher. At the same time, the content will be valuable for practitioners in industry. We will take the audience through the most influential work in these areas and describe how they relate to real world aggregated search systems. We also list some of the new challenges confronted in aggregated search and discuss directions for future work.


Estimating the Query Difficulty for Information Retrieval

Half-day - Afternoon

Presenters
  • David Carmel (IBM)
  • Elad Yom-Tov (IBM)
Abstract

Many information retrieval (IR) systems suffer from a radical variance in performance when responding to users’ queries. Even for systems that succeed very well on average, the quality of results returned for some of the queries is poor. Thus, it is desirable that IR systems will be able to identify “difficult” queries in order to handle them properly. Understanding why some queries are inherently more difficult than others is essential for IR, and a good answer to this important question will help search engines to reduce the variance in performance, hence better servicing their customer needs. The high variability in query performance has driven a new research direction in the IR field on estimating the expected quality of the search results, i.e. the query difficulty, when no relevance feedback is given. Estimating the query difficulty is a significant challenge due to the numerous factors that impact retrieval performance. Many prediction methods have been proposed recently. However, as many researchers observed, the prediction quality of state-of-the-art predictors is still too low to be widely used by IR applications. The low prediction quality is due to the complexity of the task, which involves factors such as query ambiguity, missing content, and vocabulary mismatch. The goal of this tutorial is to expose participants to the current research on query performance prediction (also known as query difficulty estimation). Participants will become familiar with states-of-the-art performance prediction methods, and with common evaluation methodologies for prediction quality. We will discuss the reasons that cause search engines to fail for some of the queries, and provide an overview of several approaches for estimating query difficulty. We then describe common metho¬dologies for evaluating the prediction quality of those estimators, and some experiments conducted recently with their prediction quality, as measured over several TREC benchmarks. We will cover a few potential applications that can utilize query difficulty estimators by handling each query individually and selectively based on its estimated difficulty. Finally we will summarize with a discussion on open issues and challenges in the field.


Search and Browse Log Mining for Web Information Retrieval: Challenges, Methods, and Applications

Half-day - Afternoon

Presenters
  • Daxin Jiang (Microsoft Research Asia)
  • Jian Pei (Simon Fraser University)
  • Hang Li (Microsoft Research Asia)
Abstract

Huge amounts of search log data have been accumulated in various search engines. Currently, a commercial search engine receives billions of queries and collects tera bytes of log data on any single day. Other than search log data, browse logs can be collected by client-side browser plug-ins, which record the browse information if users' permissions are granted. Such massive amounts of search/browse log data, on the one hand, provide great opportunities to mine the wisdom of crowds and improve search results as well as online advertisement. On the other hand, designing effective and efficient methods to clean, model, and process large scale log data also presents great challenges. In this tutorial, we focus on mining search and browse log data for Web information retrieval. We consider a Web information retrieval system consisting of four components, namely, query understanding, document understanding, query-document matching, and user understanding. Accordingly, we organize the tutorial materials along these four aspects. For each aspect, we will survey the major tasks, challenges, fundamental principles, and state-of-the-art methods. The goal of this tutorial is to provide a systematic survey on large-scale search/browse log mining to the IR community. It will help IR researchers to get familiar with the core challenges and promising directions in log mining. At the same time, this tutorial may also serve the developers of Web information retrieval systems as a comprehensive and in-depth reference to the advanced log mining techniques.


Information Retrieval for E-Discovery

Half-day - Afternoon

Presenter
  • Lewis (Independent Consultant)
Abstract

Discovery, the process under which parties to legal cases must reveal documents relevant to the issues in dispute is a core aspect of trials in the United States, and a lesser if important aspect of the law in other countries. The explosion of electronically stored information has led to a multi-billion dollar industry for e-discovery: discovery on electronically stored rather than paper documents. I will discuss the basics of the discovery process, the scale and diversity of the materials typically searched, and the economics of identifying and reviewing potentially responsive material. I will then focus on three major IR areas of interest: search, supervised learning (including text classification and relevance feedback), and support for manual relevance assessment. For each, I will discuss technologies currently used in e-discovery, the evaluation methods applicable to measuring effectiveness, and existing research results not yet seeing commercial practice. I will also outline research directions that, if successfully pursued, would be of great interest to the e-discovery industry. A particular focus will be on areas where progress can be made without access to operational e-discovery environments or “realistic” test collections. Connections will be drawn with the use of IR in related tasks, such as enterprise search, legal investigations, intelligence analyses, historical research, truth and reconciliation commissions, and freedom of information (open records or sunshine law) requests.

Last modified: 2010/07/08 15:38