Initial Overleaf Import

pgombar · Jun 23, 2020 · aa88dab · aa88dab
commit aa88dab
Show file tree

Hide file tree

Showing 32 changed files with 3,609 additions and 0 deletions.
diff --git a/1_intro.tex b/1_intro.tex
@@ -0,0 +1,69 @@
+\chapter{Introduction}
+\label{ch:intro}
+
+\section{Motivation}
+% What is the web, what is web ranking, information retrieval
+
+Given the vast amount of data on the Web today, search engines have become a key tool for finding information of interest. At their core, search engines are information retrieval (IR) systems. Information retrieval is an area of data science dealing with obtaining information from a document collection relevant to an information need. Nowadays, research in IR includes modeling, web search, text classification, systems architecture, user interfaces, data visualization, etc. \citep{baeza1999modern}. In practice, IR consists of building efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the quality of the results.
+
+The task of a search engine is to handle user queries and retrieve results relevant to that query in the form of a search engine result page (SERP). This can be a challenging problem, especially given the amount of data processed by search engines today. For example, Google, being the top search engine by market share, processes around 3.5 billion searches per day and has an index containing hundreds of billions of documents. Moreover, the rate of content change is higher than ever, emphasizing the need for results that are up to date.
+
+The retrieved results are ranked according to some properties. The three main properties we identify are the importance of the document source, relevance to the query, and recency. This thesis focuses on introducing recency ranking to an existing commercial search engine.
+
+\section{Background}
+% explain search engine architecture, ndcg, ltr
+
+A traditional search engine supports three fundamental tasks: web crawling, indexing, and searching. The crawler is responsible for discovering pages and downloading them to the search engine's internal store. The content is parsed and indexed by the indexer. Finally, the search component is responsible for matching the documents to the user query and ranking them. A high-level overview of the search process is shown in Figure~\ref{fig:searchengine}.
+
+\begin{figure}
+  \centering
+  \includegraphics[width=0.8\textwidth]{img/searchengine.pdf}
+  \caption{An example of a search engine architecture.}
+  \label{fig:searchengine}
+\end{figure}
+
+Modern search engines use learning-to-rank algorithms for the ranking task. Learning to rank is the application of machine learning in information retrieval systems in order to construct ranking models. Such a model, typically supervised, is trained on training data consisting of query-document pairs, each with a relevance degree. The purpose of the model is to rank documents, i.e., produce an ordering that is optimal with respect to an evaluation metric.
+
+Some of the evaluation metrics for ranking are mean average precision (MAP), discounted cumulative gain (DCG) and NDCG (normalized DCG), precision@n, NDCG@n (where @n denotes that the metrics are evaluated only on top n documents), mean reciprocal rank, Kendall's tau, and Spearman's rho. We use NDCG@n, so we will explain it in detail here. NDCG is a normalized version of discounted cumulative gain \citep{jarvelin2002cumulated}. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks. The graded relevance value is reduced logarithmically proportional to the position of the result, therefore penalizing highly relevant documents appearing lower in a search result list. The DCG at rank position $p$ is defined as:
+\[ \mathrm {DCG_{p}} =\sum _{i=1}^{p}{\frac {rel_{i}}{\log _{2}(i+1)}}, \]
+where $rel_{i}$ is the graded relevance of the result at position $i$.
+
+We use normalized DCG, because search result lists vary in length depending on the query. In this case, the cumulative gain at each position for a chosen value of $p$ should be normalized across queries. We do this by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position $p$, called Ideal DCG (IDCG) through that position. NDCG is defined as:
+\[ \mathrm {nDCG_{p}} = \frac {\mathrm {DCG_{p}}}{\mathrm {IDCG_{p}}}, \]
+and IDCG is defined as:
+\[ \mathrm {IDCG_{p}} =\sum _{i=1}^{|REL|}{\frac {rel_{i}}{\log _{2}(i+1)}}, \]
+where $|REL|$ represents the list of ordered relevant documents up to position $p$.
+
+In an online scenario, when a user types a query into a search engine, they expect to see results in a very short time, typically a few hundred milliseconds. This makes it impossible to evaluate a complex ranking model, so modern search engines use a two-phase retrieval and ranking strategy \citep{cambazoglu2010early}. In the first phase, the scoring is very simple, enabling a quick retrieval process. Consequently, the second phase uses a more complex model, since now only a small subset of the overall document index is being ranked.
+
+The query-document pairs are scored, then ranked, based on their feature vectors. The features can be divided into three groups:
+\begin{enumerate}
+	\item Query-independent or static features: features depending only on the document, but not on the query. For example, PageRank \citep{page1999pagerank} or the document's length. These features can be precomputed offline when indexing the document.
+	\item Query-dependent or dynamic features: features depending both on the contents of the document and the query, such as the similarity between the query terms and the document content.
+    \item Query features: features depending only on the query, such as the number of words in a query.
+\end{enumerate}
+
+\section{Problem Definition}
+% talk about relevance vs recency and why it's difficult
+The effectiveness of a search engine depends on the relevance of the result set it returns. Search engines employ ranking methods to provide the best results first. As more and more content is put on the Web, now more quickly than ever, the temporal aspect of results is very important for modern search engines. In web search, recency ranking refers to ranking the documents both by their relevance to the user query, and their freshness. The search engine can appear stale if it failed to recognize the temporal aspect of a query, which can also negatively affect the user experience. Effectively combining temporal features, as well as relevance-sensitive features in a ranking framework is still an open question for the research community.
+
+The connection between relevance and recency is not always clear. In other words, if optimizing for only one of them, we might decrease the effectiveness of the search engine. Therefore, it is necessary to find an equilibrium. The trade-off can also depend on the user's query. For example, for a recency sensitive query, i.e., a query for which a user expects topically relevant documents to be also fresh, recency is very important. On the other hand, for timely queries, i.e., queries that do not show a spike in their popularity over time, relevance should be more important. Figure \ref{fig:serp} shows an example of a blend of relevant and recent results.
+
+Even when the query type is identified, in order to incorporate both recent and relevant results, we need to determine the recency component on the document side. When dealing with general documents from the Web, this is not a trivial task, as most of the documents do not have any temporal information.
+
+\begin{figure}
+  \centering
+  \includegraphics[width=0.7\textwidth]{img/serp.png}
+  \caption{An example of a search engine result page (SERP) from Google for the query \textit{Carles Puigdemont} submitted on June 29, 2018.}
+  \label{fig:serp}
+\end{figure}
+
+In our experiments, we ask the following research questions:
+\begin{enumerate}
+    \item How can we determine if a document's content is recent without having information on its publication date?
+    \item How can we determine when queries seek recent results?
+    \item How can we introduce recency ranking into an existing relevance ranking architecture?
+\end{enumerate}
+
+\section{Contributions}
+The contributions of this thesis are both theoretical and practical. In Chapter~\ref{ch:doc}, we present a document age prediction model, used to predict how recent a document is by extracting temporal evidence from its content. In Chapter \ref{ch:query}, we present a query recency sensitivity classifier, used to predict the need of recent results for a query. Finally, in Chapter \ref{ch:webranking}, we present a novel method of incorporating recency with the help of these two classifiers into an existing search engine. We also present an exhaustive evaluation of all presented models.
diff --git a/2_relatedwork.tex b/2_relatedwork.tex
@@ -0,0 +1,20 @@
+\chapter{Related Work}
+\label{ch:relatedwork}
+
+% Carbon Dating The Web: Estimating the Age of Web Resources: https://arxiv.org/pdf/1304.5213.pdf
+
+% Predicting document creation times: https://dbs.ifi.uni-heidelberg.de/files/Team/aspitz/publications/Spitz_et_al_Predicting_Document_Creation_Times.pdf
+
+In the past decade, there have been a number of papers approaching the problem of combining relevance and recency in ranking. Some of them deal with more specific domains than web search, such as mail search \citep{carmel2017promoting}, microblog retrieval \citep{efron2012query}, and news search \citep{dakka2012answering}. In the domain of mail search, a recency-only ranking is traditionally implemented. Even when time-ranked results are mixed with relevance-based results, as \cite{carmel2017promoting} report, showing duplicate results to the user is not discouraged. These two key points make mail search a different problem than web search. Moreover, when it comes to news search, it is assumed that the publication date of the document is available \citep{dakka2012answering}. For the majority of documents on the Web, this is not the case, making web search a more difficult problem.
+
+The works that directly improve ranking recency in web search are \citep{dong2010towards,dong2010time,dai2011learning,styskin2011recency}. \citet{dong2010towards} focus on breaking-news queries and build different rankers. If a query is classified as recency-sensitive, a recency-sensitive ranker is used to rank the documents matching that query. Similar to our work, they model different time slots by building language models from different sources (news content and queries) and compare them in order to classify if a query is recency-sensitive or not. Furthermore, they extract recency features from documents, and define the \textit{page age} as the time between the query submission time and page publication time, which is either the page creation time or the time it was last updated. The recency features they extract from documents are content-based and link-based. In our approach, we only extract content-based evidence. Finally, they manually annotate query-url pairs and train two separate models: a recency ranker and a relevance ranker. The key difference between this work and ours is that we do not manually annotate data, but automatically produce labels, we use only one ranking model for all queries, and we focus on all types of queries.
+
+Extending their existing approach, \citet{dong2010time} introduce Twitter features to help with recency ranking. However, they do not focus on the fresh content produced by tweets to determine if the query terms are recency-sensitive, but instead they employ a URL mining approach to detect fresh URLs from tweets and learn to rank them.
+
+Instead of having separate rankers, \citet{dai2011learning} propose a machine learning framework for simultaneously optimizing recency and relevance, focusing on improving the ranking of results for queries based on their temporal profiles. They determine the temporal profile of a query by building a time series of the relevant documents' content changes. Next, they extract temporal features from the seasonal-trend decomposition of different time series. We could not use this approach, as we only keep the most updated snapshot of documents in our index.
+
+Similarly, \citet{cheng2013fresh} model term distribution of results over time and conclude that this change is strongly correlated with the users' perception of time sensitivity. However, they only focus on improving ranking for timely queries, i.e., queries that have no major spikes in volume over time, but still favour more recently published documents. In a similar work, \citet{efron2011estimation} assume that the document publication time is known. They compute a query-specific parameter that captures recency-sensitivity and is calculated based on the distribution of the publication times of the top retrieved documents by a relevance-based ranker.
+
+However, when document publication time is not known, we have to search for other indicators of document recency and query recency sensitivity. For example, \citet{campos2016gte} use temporal expressions from web snippets related to the query to improve the ranking. In this work, we do the same. Other possible sources of temporal features include click logs and query logs. \citet{wang2012joint} learn the relevance and recency models separately from two different sets of features. Their divide-and-conquer learning approach is similar to \citep{dai2011learning}, but in this work, they omit manually annotating data, and instead automatically infer labels using clickthrough data. We do not use click logs, but we do extract frequency features from our query log, similarly to \citep{metzler2009improving, lefortier2014online}.
+
+So far, we have explained how related work has extended existing learning-to-rank algorithms to take into account both relevance and recency. Another approach to recency ranking is an aggregated search strategy. In other words, recent results are extracted from a fresh vertical (such as news articles) and subsequently integrated into the result page. The blending of results from different verticals is called result set diversification. Examples of this approach are \citep{lefortier2014online,styskin2011recency}. In these two papers, a classifier is used to score the recency sensitivity of a user query, and this score is used to determine to what extent the documents from the fresh vertical should be inserted. Even though we do not perform result set diversification, we share similarities with these papers. The query fresh intent detector from \citep{lefortier2014online} is similar to our query recency classifier in terms of features used, and we have modeled our query ground truth labels according to the distribution reported in \citep{styskin2011recency}.