I want to share my thoughts regarding a problem of measuring the enterprise search engine quality. Actually, I should say, measuring the relevance of search results (or effectiveness), but it is pretty much the same, considering that the search engine is fast and indexing works, i.e. all documents are properly indexed, formats and types are recognized, and metadata is extracted. You might think, this is a common task, but, strangely enough, looking in books, publications, presentations, and simply searching the Internet, I could not find a satisfying answer on this particular question. After wasting some reasonable amount of time I gave up and decided to use my own methodology, which I want to present here for your discussion.
First, I want to talk about what we are measuring. The answer is simple - the relevance. But what is the relevance? According to Wikipedia: "Relevance describes how pertinent, connected, or applicable something is to a given matter. A thing is relevant if it serves as a means to a given purpose." Not bad, but how does it relate to the search engine? Again, in the Wikipedia we can find: "In information science and information retrieval, relevance denotes how well a retrieved document or set of documents meets the information need of the user." Much better: this basically means that the relevance can be measured in terms of how well the search engine can retrieve a document the user needs. Hey, I'm going to use that!
Lets see what is out there. There is (of course) a bunch of ideas published on the Internet about how to measure the relevance of public search engines. There are Text Retrieval Conference (TREC) evaluation measures. There is a Cranfield-style evaluation of relevance. They are all pretty much based on the same idea, which has been in place since the 1960s, or even earlier. There are some interesting newer methods, like reciprocal rank measure, rank-biased precision measure, and methods addressing the novelty and diversity.
Most of the relevance evaluations, which I found on Internet, propose to run a query and then look in results and decide how relevant they are. I have an objection to that. If you will try to pick a query and then look in results and decide what is relevant and what is not, you can quickly roll into that grey area, where results perception depends on the person who is doing the test, his (or her) education, knowledge of the subject, knowledge of the underlying collection, knowledge of search conditions, et cetera, including the amount of drinks he (or she) consumed during yesterday's party. I like how Wikipedia says on the same page: "The important thing to recognize is, however, that relevance is fundamentally a question of epistemology, not psychology." Thank God, at least psychology is in second place!
To help you to realize how difficult the decision on relevance might be, I want to give you a little quiz: Do you think the annual report of a large financial institution is relevant to the query "credit card"? To make it easier I can add that many pages in that report talk about how well the company did in a field of credit card business. Is that document relevant? You may say "Yes", or you may say "No". I wouldn't argue with you any way, because you may be right in both cases. Or, you may be wrong... But what if I'll tell you that this is the only document available in your collection, which is mentioning something about credit cards? Isn't it relevant now? It is interesting how easy our brain can switch from "not relevant" to "very relevant" in an instant, just with a little bit of additional information! Actually, I believe that the proper answer to this question should be: "It depends". As it is about relevance, the answer itself is, sort of, relative. To have any opinion, I guess, we need to know at least what else is there.
Now let's take a look at relevance measures, the precision and recall, as they are used in measuring the quality of a search engine. Again from Wikipedia: "Precision is the fraction of the documents retrieved that are relevant to the user's information need." Oops, stop! So, in order to measure the relevance, we have to know if the documents retrieved are relevant to the user's information need... But, again, what does it mean to be relevant? How much relevant? What if they are just a little bit relevant? And what if they have so little relevance that I would like to say that they are not relevant at all? Trying to overcome this particular question, TREC uses the pooling technique and a number of human assessors to judge on relevance. And regarding recall, I'm not sure at all how well it serves the purpose of our evaluation, as with the current number of documents at any enterprise, almost all queries will return hundreds or thousands of hits, and the reality is that nobody cares if there is one hundred or ten thousand results. Do you know what your users care about? They care about result number one - big time! They also care about other results on the first page. They care, but much less, about the results on the second page, but if they went down on the third (or further) page of search results, it most likely means that your search engine does not work well.
I'm also not convinced that it is realistic to implement the Cranfield-style evaluation of relevance with your dataset. The Cranfield-style evaluation is very demanding. It involves assigning a relevance level to each retrieved result, and once the relevance levels have been assigned, the information retrieval performance measures can be used to assess the quality of a retrieval system's output. To do such evaluation you basically have to know the entire collection and review each document in a collection in terms of relevance to the query. This is a lot of hard work and this is very tricky. And this is why I'm skeptical about this approach - you will simply have no time to do such evaluation.
I have to say that I like an idea of graded relevance assessments, like the Discounted Cumulative Gain measure, but, again, it requires knowledge of the ideal gain vector, which you may not have.
Anyway, lets talk about the methodology I used. I have to say that I would rather not come up with it, and it is not without problems on its own, but, at the same time, I could not find anything better, it is relatively easy to implement, and it appears to be working.
First, I want to mention 3 important points:
A. Make sure that all evaluated search engines contain the same documents in the collection we are testing. The fact is that you might never get absolutely the same number of documents, but at least make sure it is as close as possible. I also want the number of documents to be significant. We don't want to get into the situation where all search engines return only few results. It is hard to judge which search engine is better in such case. So, make the collection bigger. The bad news is that you may find that the more documents you have in a collection, the less relevance you will see in results. The good news is that you can simply index a larger collection of different documents to have more distinct results in terms of relevance.
B. Establish the base of relevance. Yes, that's it. If we are talking about the relevance, I want to have a base result, which is very relevant to the evaluated query, and relative to which I will judge other results. This is very important. Without a base there is no meaning of relevance. I guess it is because your relevance statements in this case are baseless, but its just me ;-). With the base I can measure the relevance of results in relation to this base. And now relevance judgment becomes very simple, just compare other result to the base!
C. Invite several people with a different background to do testing. This is a no-brainer, as we still have to judge on relevance (no way around), we will be less subjective with more people performing the test. You only need to make sure they know how to do testing properly.
Here we go, let's get started.
1. Pick any document from indexed collection and make sure all our search engines have this document indexed. You should be able to verify if a document was indexed using exact match of the title as a search query, or some long phrase from the document. This document will be our base (actually one of our bases).
2. Study this document and come up with 2 or 3 different queries, for which you would expect this document to be on top of search results, i.e. these should be very relevant to this document queries. Picking the query, try to use words, which would return a rather significant number of search results. Try not to use an exact match of a document title, as it is very easy task for any search engine to return the document with exactly matching title as number one in results.
3. Run the selected search queries on each search engine, make sure results are sorted by relevance, and record the position of the base document in search results. If you cannot find that document on the first two pages of results (let's say, we have 10 results per page), just record that the document was not found and put 21 as a score. Note that if tested search engines cannot find that document on first two pages, you must have selected a bad query for the document, or you might have too many very similar documents. If this is the case, try to pick another query, or a different document.
4. Review each document in the search results, which are positioned above the base document, and decide if you would consider the document above to be more relevant to the search query. Do not make your decision based on statement like "it also has these words", but look in the content of the document and make your decision based on the whole content. It might be time consuming, but the decision on relevance usually is easy because you have a base.
5. Count the score for each query and each search engine based on the position of the base document in the search results. For each document above the base (from step 4) deduct 1 from the position of base document, if you have concluded that the document is equally relevant, or even more relevant to the search query than the base document. For example, if the base document is in position 3, but the document in position 1 is more relevant to the search query than the base, and the document in position 2 is less relevant to the search query than the base, the score would be equal to 2. If both documents on top are equally relevant to the search query as the base document, the score would equal 1. If both documents on top are less relevant to the search query as the base document, the score would be equal to 3.
6. Repeat steps 1-5 for different documents. Pick large and small documents, pick different formats, pick documents, which are major landing pages on your site, and documents, which are hidden in the depth of the navigation path, i.e. test different scenarios.
7. Do steps 1-6 inviting different people for testing, but make sure they know the procedure and they have no preference to one particular search engine. Let them test the area they are familiar with. The more tests you do - the better the result.
8. Put the obtained scores for each query in an Excel table and make a separate sheet for each tested search engine.
9. Now you can calculate the average score for each search engine (using the obtained scores for each base document) and it will reflect the relevance of the search engine.
The average score may help you to sort the search engines by relevance, however, I recommend obtaining the following measures to see the picture more clearly.
10. Calculate the percent of results when the search engine had a score of 1, the percent of results when the search engine had a score of 1-5, and also the percent of results with a score equal to more than 10. In terms of relevance, the interpretation of these measurements will be the following: If you are searching for a target document with a very relevant query, the search engine will find this document and place it in the search results (on average) according to obtained table of results, which reflects the position of the target document in search results as the first result, within five top results, or not on the first page.
This table is very easy to understand and can be presented to your management, who will pay many thousands or may be even millions of dollars for the search engine, and who really wants to know why you picked this search engine and not that one.
You may ask: If I get the target (base) document on the very top, what about other documents in search results? Why we are not counting them? Can I look at each document, at least on the first page? You can look, but there is a big BUT. Imagine you got 10 results for your query. Number one is perfect and other 9 are junk and returned only because they contain words from your query. Does your search engine work well or does it not? In my opinion, most likely, it does work well. You wanted to find the target document, right? You got it on top, what else do you want? If you are concerned that other documents are not relevant, first make sure that there are other documents in your collection, which are more relevant. Quite possibly there is nothing and it means that your search engine works perfect. However, if you see that other search engine found better document, try to experiment with that document using it as a base document for testing. You might discover some problems with the first search engine, or with the indexing process.
In conclusion I want to say that the problem is complex and this might not be the best way to measure the relevance, but it is easy to try and you can see the result immediately. I'll be happy to hear your critique and suggestions and, if you know a better method, please share it here.
Published: June 2011