SoftCorporation LLC.

How to measure a relevance of the Enterprise Search Engine

I want to share with you my thoughts regarding a problem of measuring the enterprise search engine quality. Actually, I should say, measuring the relevance of search results (or effectiveness), but it is pretty much the same, considering that search engine is fast and indexing works, i.e. all documents are properly indexed, formats and types are recognized, and all metadata are extracted. You might think, this got to be a common task, but, strangely enough, looking in books, publications, presentations, and simply searching the Internet, I could not find satisfying answer on this particular question. After wasting some reasonable amount of time I gave up and decided to use my own methodology, which I want to present here for your discussion.

First I want to talk about what we are measuring. The answer is simple - the relevance. But what is the relevance? According to Wikipedia: "Relevance describes how pertinent, connected, or applicable something is to a given matter. A thing is relevant if it serves as a means to a given purpose." Not bad, but how it relates to the search engine? Again, in the Wikipedia we can find: "In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user." Much better, this basically means that the relevance can be measured in terms of how well the search engine can retrieve a document or a set of documents, which I need Hey, Im going to use that!

Lets see what is out there. There is (of course) bunch of ideas published on Internet about how to measure the relevance of public search engines. There is Text Retrieval Conference (TREC) evaluation measures. There is a Cranfield-style evaluation of relevance. They are all pretty much based on the same idea, which is in place since 1960s or even earlier. There are some interesting newer methods, like reciprocal rank measure, rank-biased precision measure, and methods addressing the novelty and diversity.

Most of the relevance evaluations, which I found on Internet, propose to run a query and then look in results and decide how relevant they are. I have an objection to it. If you will try to pick a query and then look in results and decide what is relevant and what is not, you can quickly roll into that grey area, where results perception depends on the person who is doing the test, his (or her) education, knowledge of subject, knowledge of underlying collection, knowledge of search conditions, et cetera, including amount of drinks he (or she) consumed during yesterday's night party. I like how Wikipedia says on the same page: "The important thing to recognize is, however, that relevance is fundamentally a question of epistemology, not psychology." Thanks God, at least psychology is in a second place!

To help you to realize how difficult the decision on relevance might be, I want to give you a little quiz: Do you think the annual report of a large financial institution is relevant to the query "credit card"? To make it easier I can add that many pages in that report talk about how well the company did in a field of credit card business. Is that document relevant? You may say, yes, or, you may say, no. I wouldn't argue with you any way, because you may be right in both cases. Or, you may be wrong... But what if I'll tell you that this is the only document available in your collection, which is mentioning something about credit cards? Isn't it relevant now? It is interesting how easy our brain can switch from "not relevant" to "very relevant" in instant, just with a little bit of additional information! Actually, I believe that proper answer to this question should be: "It depends". As it is about relevance, the answer itself is, sort of, relative. To have any opinion, I guess, we need at least to know what else is there.

Now lets take a look at relevance measures and first of all at the precision and recall, as they are used in measuring the quality of a search engine. Again from Wikipedia: "Precision is the fraction of the documents retrieved that are relevant to the user's information need." Oops, stop! So, in order to measure the relevance, we have to know if the documents retrieved are relevant to the user's information need... But, again, what does it mean to be relevant? What about how much relevant? What if they are just a little bit relevant? And what if they are so little relevant that I would like to say that they are not relevant at all? Trying to overcome this particular question TREC uses the pooling technique and a number of human assessors to judge on relevance. And regarding recall, Im not sure how well it serves the purpose of evaluation at all, as with current number of documents at any enterprise, almost all queries will return hundreds or thousands hits, and reality is that nobody cares if there is one hundred or ten thousand results. Do you know what your users are care about? They care about result number one - big time! They also care about other results on the first page. They care, but much less, about results on a second page, but if they went down on third (or further) page of search results, most likely it means that your search engine does not work very well.

Im also not convinced that it is realistic to implement Cranfield-style evaluation of relevance if you want to compare how different enterprise search engines are working with your dataset. The Cranfield-style evaluation is very demanding. It involves assigning a relevance level to each retrieved result, and once relevance levels have been assigned, the information retrieval performance measures can be used to assess the quality of a retrieval system's output. To do such evaluation you basically have to know entire collection and review each document in a collection in terms of relevance to the query. This is a lot of work and this is very, very tricky part. And this is why Im skeptical about this approach - you will simply have no time to do such evaluation.

I have to say that I like an idea of graded relevance assessments, like Discounted Cumulative Gain measure, but, again, it requires knowledge of ideal gain vector, which you may not have.

Anyway, lets talk about the methodology I used. I have to say that I would rather not come up with it, and it is not without problems on its own, but, at the same time, I could not find anything better, it is relatively easy to implement, and it appears to be working.

First, I want to mention 3 important points:

A. Make sure that all evaluated search engines contain the same documents in the collection we are testing. The fact of life is that you might never get absolutely the same number of documents, but at least make sure it is as close as possible. I also want the number of documents to be significant. We don't want to get into the situation where all search engines return only few results. It is hard to judge which search engine is better in such case. So, make the collection bigger. The bad news is that you may find that more documents you have in collection, less relevance you will see in results. The good news is that you can simply index larger collection of different documents to have more distinct results in terms of relevance.

B. Establish the base of relevance. Yes, that's it. If we are talking about the relevance, I want to have a base result, which is very relevant to evaluated query, and relative to which I will judge other results. This is very important. Without a base there is no meaning of relevance. I guess it is because your relevance statements in such case are baseless, but its just me ;-). With the base I can measure the relevance of results in relation to this base. And now relevance judgment becomes very simple, just compare other result to the base!

C. Invite several people with different background to do testing. This is no-brainer, as we still have to judge on relevance (no way around), we will be less subjective with more people performing the test. You only need to make sure they know how to do testing properly.

Here we go, let's get started.

1. Pick any document from indexed collection and make sure all our search engines have this document indexed. You should be able to verify if a document was indexed using as a search query the exact match of document title, or some long phrase from the document. This document will be our base (actually one of our bases).
2. Study this document and come up with 2 or 3 different queries, for which you would expect this document to be on top of search results, i.e. these should be very relevant to this document queries. Picking the query, try to use words, which would return rather significant number of search results. Try not to use exact match of a document title, as it is very easy task for any search engine to return the document with exact match of a title as number one in results.
3. Run selected search queries on each search engine, make sure results are sorted by relevance, and record the position of base document in search results. If you cannot find that document on first two pages of results (let's say, we have 10 results per page), just record that the document was not found and put 21 as a score. Note that if tested search engines cannot find that document on first two pages, you must have selected not very good query for the document, or you might have too many very similar documents. If this is the case, try to pick the other query, or the other document.
4. Review each document in search results, which is positioned above the base document, and decide if you would consider the document above to be more relevant to the search query, then the base document. Do not make your decision based on statement like "it also has these words", but look in the content of the document and make your decision based on whole content. It might be time consuming, but the decision on relevance usually is easy because you have a base.
5. Count the score for each query and each search engine based on the position of the base document in search results. For each document above the base (from step 4) deduct 1 from the position of base document, if you have concluded that the document is equally relevant, or even more relevant to the search query, then the base document. For example, if the base document is in position 3, but the document in position 1 is more relevant to the search query then a base, and the document in position 2 is less relevant to the search query then a base, the score would be equal 2. If both documents on top are equally relevant to the search query as the base document, the score would be equal 1. If both documents on top are less relevant to the search query as the base document, the score would be equal 3.
6. Repeat steps 1-5 for different documents. Pick large and small documents, pick different formats, pick documents, which are major landing pages on your site, and documents, which are hidden in the depth of navigation path, i.e. test different scenarios.
7. Do steps 1-6 inviting different people for testing, but make sure they know the procedure and they have no preference to one particular search engine. Let them test the area they are familiar with. More tests you are doing - better will be result.
8. Put obtained scores for each query in Excel table and make separate sheet for each tested search engine.
9. Now you can calculate the average score for each search engine (using obtained scores for each base document) and it will reflect the relevance of the search engine.
The average score may help you to sort tested search engines by relevance, however I recommend obtaining following measures to see the picture more clearly.
10. Calculate the percent of results when the search engine had score 1, the percent of results when the search engine had score 1-5 and also percent of results with score more then 10. In terms of relevance, the interpretation of these measurements will be following: If you are searching for target document with very relevant for this document query, the search engine will find this document and place it in search results on average according to obtained table of results, which reflect position of target document in search results as the 1st result, within 1st 5 results, or not on the 1st page (we agreed earlier that our search results page has 10 results).

This table is very easy to understand and can be presented to your management, who will pay many thousands or may be even millions of dollars for the search engine, and who really wants to know why you picked this search engine and not that one.

You may ask: What if I get the target (base) document on the very top, what about other documents in search results? Why we are not counting them? Can I look at each document, at least on the first page? You can look, but there is a big BUT. Imagine you got 10 results for your query. Number one is perfect and other 9 are junk and returned only because they contain words from your query. Does your search engine work well or does it not? In my opinion, most likely, it does work well. You wanted to find the target document, right? You got it on top, what else do you want? If you are concerned that other documents are not relevant, first make sure that there are other documents in your collection, which are more relevant. Quite possible that there is nothing and it means that your search engine works perfect. However, if you see that other search engine found better document, try to experiment with that document using it as a base document for testing. You might discover some problems with the first search engine, or with the indexing process.

In conclusion I want to say that the problem is complex and this might be not the best way to measure the relevance, but it is easy to try and you can see the result immediately. Ill be happy to hear your critique and suggestions and, if you know better method, please share it here.

Vadim Permakoff

Published: June 2011

For more information send request to: