luni, 24 ianuarie 2011

Search Engine Relevance

What is relevant? What answer are people truly seeking when they approach an Internet search engine? As anyone who has spent time at a reference desk, information center, or has otherwise acted as an intermediary in the information seeking process knows, determining relevance is a complex art. Two people asking the exact same question might consider completely different answers relevant.
Search engines attempt to use automated scientific methods and techniques to deliver relevant results. It is a complex process that sometimes works wonderfully and other times can be frustratingly (or amusingly) far off the mark. With huge search engine databases combined with extremely high numbers of daily searches, the search engines are in an excellent position to continually adapt the relevance techniques to try and achieve ever better success.
The rapid development cycles in the Internet business create a perpetual churn rate for relevance ranking choices. Search engines rank by their relevance scores, and even for those that offer other sort options, relevance ranking is the default.

RELEVANCE AND AUDIENCE

The ability to provide relevant results depends greatly on the kind of searches submitted to a search service and the type of searcher. It is no surprise that searches performed by information professionals differ from those submitted by the general public. And advanced searchers make up only a very small proportion of a search engine's traffic. Let's take a look at the two approaches.
For an example of searching in the style of the information professional, take a look at the July 1993 NLM Technical Bulletin. It had a Gold Standard Search where searchers had a chance to match wits with NLM experts in a MEDLINE search using the MEDLARS system. The NLM strategy for the Gold Standard Search for predictors of long-term success in the maintenance of weight loss involved several steps. Some steps of the search included queries such as the following:
(*obesity or *obesity, morbid) and th&/px or weight gain
exp outcome assessment (health care) or follow-up studies or evaluation studies or exp time
(tw)Long and term or all maintain: or maintenance
ts (la) :eng: and :human: (mh)
Contrast that with the kinds of searches today's search engines must cope with. Take a peek at what some of the submitted searches look like. MetaSpy (http://www.metaspy.com) provides a list of current MetaCrawler searches that is refreshed every 15 seconds. A recent look found these queries:
harry potter
total annihilation downloads
news on victims of fraud
Netscape
divorse
Just a bit less precise, perhaps? What aspect of the Netscape search is the person interested in? A link to the main company site, an evaluation of their Web browser, current ownerships, early history, former stock performance, or...?
Admittedly, most professional searchers do not use a Gold Standard search approach when they search the Internet. With no MESH, there is no explosion of subject terms, subheadings, or major and minor headings. And there are some very sophisticated searchers among the general public. Even so, the point is that the majority of searches submitted to search engines use few terms, do not use advanced features, and offer little or no context.

DATABASE SCOPE

In addition, the database scope differs significantly between a bibliographic database and the search engine databases. MEDLINE is a well-indexed database with a controlled vocabulary. It indexes a finite group of documents within a specific discipline. The Internet is not indexed with a controlled vocabulary. It also has a very broad range of documents. The Internet offers access to Web pages, pictures, sound files, video files, software programs, etc.
Even for the general search engine databases that limit themselves to crawling Web pages, the range of documents is immense. Some Web pages have no words on them. Others load an entire dictionary file on a single page. There are pages in all sorts of languages using a variety of diacritics and character sets. Entire Web server log files are available on a single page. Other pages contain just one word, repeated endlessly. The Web covers a multitude of subjects, document types, languages, and audience levels. Some top-level pages on a site consist of a Flash introduction, where words only exist in the images. With no words to use, such a page challenges a search engine to appropriately rank it in terms of relevance.
With this variety as the backdrop, how can a search engine provide several relevant hits to the user entering a query like Netscape or harry potter or thousands of other similar searches? And on top of all these problems come the Web publishers and the search engine positioning industry who are busy reverse engineering the search engine company's relevance technology to serve their own ends.

THE POSITIONING BATTLE

Understanding the work of the search engine positioning industry helps in interpreting the relevance ranking the search engines deliver. At the most basic level, an organization that establishes a Web site would like people to find it easily. And for ecommerce Web sites, the more customers they can attract the better.
So it is no wonder that organizations work at making sure their Web site can be found. At the most basic level, the site needs to be included in the directories and search engines. If a site is not contained in the database, no relevance ranking can pull it up. But positioning concerns go far beyond just getting a site included.
First of all, a company would like its site to be the top, most relevant hit, when someone searches on the company name. For a long time, the main search engines failed in this regard, although, in the past year, all have greatly improved. Generally, this should be a relatively simple process.
Unfortunately, the vagaries of language work against that. A company with a unique name or a unique product name has a big head start. One with a commonly used name has a tougher time. In the case of a common name, there is no single site that should obviously be listed as more relevant than the others, unless additional search terms are added.
Secondly, companies would like to rise to the top of the relevance ranking when someone searches on their product names, their industry, generic terms related to their products, etc. Here is where the positioning industry can make a difference. A travel agent might like to gain first listing anytime someone searches for airline reservations. A positioning expert would look at their site and then optimize the site by adding keywords, metatags, different titles, creating gateway pages, and using other techniques.

After the changes, the page is resubmitted to the search engine. Then the positioning techniques used should increase the ranking of the travel agent's site in the results of a search using the terms air Line reservations. Of course, there are plenty of other travel agents and travel agency Web sites that are also working on optimizing their search engine ranking.
So who wins the search engine ratings game? It varies day to day and search engine to search engine. And a fair amount of money is riding on it. For sites that rank well on a particular search engine, their traffic (and thus their sales) may be doubled or more. If the relevance ranking algorithms for that search engine change, and their site is no longer in the top, you can imagine the company's concern.
But what is relevant to users that enter such a search? Searchers might actually be trying to find the American Airlines reservations page and not a travel agency site. Or they may he doing a research paper about the change that online airline reservations have made to the travel industry. The problem again is that different people ask the same question but hope for different answers. And the sites that have been optimized for a high relevance score may or may not provide that answer.
Even a distinctly named company could have a difficult time rising to the top of the search results. It should be the most relevant result from a search on its name. A competitor or a disgruntled ex-employee might optimize his page to show up higher in the relevance stack than the company's own site.

LINK ANALYSIS

To combat overzealous optimization, relevance based on link analysis offers a welcome respite. Spearheaded by Google, link analysis now plays an important role in all of the major search engines. The basic principle of link analysis rates highest those pages that most other pages point to using the search term. In other words, if 200 pages point to Wiselynamed.com when they make a hyperlink on the word Wiselynamed, then that will rank higher than Wiselynamedsucks.com if only 20 pages point to that URL.
The strength of the link analysis approach is that it makes it much more difficult to optimize inappropriately, since that would require changing other people's pages. Previous relevance criteria were all determined by words and word patterns on the page itself. Link analysis looks at many other pages to see where they point.
The link analysis approach is an excellent ranking mechanism for some searches, but not for all. New sites are at a distinct disadvantage. When someone puts up a new Web site, he can submit it to several search engines, but it takes time for a search engine to crawl new sites. And even if that site is indexed, there may not be other pages linking to the new site yet.
Take, for example, the new television series Jack of All Trades. The name is not distinctive and has even been used for a character on another television show. There was an official Web site for the show, but the week the show was launched, there was no listing for the official Web site within the directories of Yahoo!, Open Directory (and, thus, Lycos, HotBot, Netscape, and AOL), LookSmart, Infoseek, or Excite.

The search engine components fare better, but are somewhat hampered by the link analysis. On Google, a phrase search for "jack +of all trades" found two sites in its top ten that discuss the show, but neither linked to the official site. AltaVista's top ten found nothing on the show. Neither did the top ten results from Fast's All the Web, Lycos, HotBot, WebCrawler, Excite, or Yahoo!'s Inktomi. Both Infoseek's and Northern Light's top ten had a link to the star's home page which in turn linked to the official site, but neither Infoseek nor Northern Light found the page themselves.
The similar technique of popularity ranking, used by Direct Hit and its partner sites, also failed on this one. Ranking sites based on what other people clicked on when they used the same or a similar search does not help when there is a new site for the search term. Yet recognizing a shortcoming with these techniques does not mean they're not useful for many other kinds of searches.

PRACTICAL RELEVANCE UPDATE

In a column from last year, "Rising Relevance in Search Engines," ONLINE, May 1999, p. 84), several practical techniques for providing relevant results were described. Some of the other practical relevance techniques of last year are still very much in force. Many of the search engines now offer results from several databases. RealNames, Direct Hit, Ask Jeeves, and other databases often serve up more relevant hits than the regular relevance ranked results from the search engine database of indexed Web pages.
A comparison of search engines run for last year's article involved a test on how well they each found the Web site for R.R. Bowker at http://www.bowker.com. Only Infoseek and Google ranked that page as the number one hit on a search on the word bowker. Rerunning the search this year found significant improvements and more reliance on link analysis.
Last year, AltaVista put Joe Bowker's page in first position, but this year, with their updated relevance ranking criteria, R.R. Bowker site's top page ranked first. Northern Light, Excite, Infoseek, Google, and Lycos also all ranked www.bowker.com in their number one slot. Fast's All the Web had it at number two, with a subsidiary page for Books in Print as the top hit. HotBot placed an older URL (http://www.reedref.com/bowker/) in the top spot--a hit drawn from the Open Directory. Choosing the link to Top 7 sites for bowker would bring up the main page, drawing results from Direct Hit. As to the Inktomi results, they point to the same URL in the Open Directory.

SPELLING AND BAD QUERIES

Many general searches contain misspellings. Watch one of the sites like MetaSpy for a while to see what I mean. Searches containing typos like divorse, tietanic, messanger, alzhimers, and marshmello are going to have problems. With searches such as these, the search engines are challenged again to somehow provide relevant results to poorly-formed queries.
One approach, seen at AltaVista, is to suggest alternate spellings, even while retrieving results that include the misspellings. However, spell checking is an imprecise art. Search on divorse in AltaVista and the prompt reads "Spell check: did you mean diverse?", while it retrieves over 500 pages with that exact misspelling.

As evidenced by the MetaSpy examples listed earlier, many searches only use one or two words. No matter how well-developed the relevance ranking, many searchers will go away dissatisfied because they misspelled a word or entered too few general words.
Not only does AltaVista offer spelling suggestions, it also offers related searches, which are suggested phrase searches that will help narrow the search. Go and HotBot also provide such suggestions. Northern Light sorts results into folders for further refinement. Oingo and Simpli.com both have technology for suggesting alternate meanings. Excite suggests other terms to add to the search.
Excite also features a More Like This link for each hit that finds related Web pages. This is a very useful feature for the searcher who may not be able to come up with additional terms or a better way of expressing the query. This ability is also available on AltaVista, Go, AOL Netfind, and Google (where it is known as GoogleScout). These approaches are one way the search engines aim to provide better relevance by helping searchers narrow their results.

THE SINGLE-ANSWER APPROACH

The search services that provide a single answer to a query take another approach to narrowing. Ask Jeeves is probably best-known in this field. A newcomer is Fact City, which is developing its offerings. In both cases, the idea is that the searcher receives one correct answer for their question rather than a long list. Ask Jeeves typically presents several alternatives, but Fact City is planning on providing just a single answer with no intermediate clicking involved.
It certainly saves time for the user--enter question, receive the answer. Sounds like what we aim for at our reference desks. As it is in person, so it is on the Net. Some questions need only a simple, single answer. "What is the call number for this book?" On the other hand, most questions benefit from more interaction and a more detailed answer.
In an article about Fact City (http://www.digitalmass.com/columns/internet/0119.html), Michelle Johnson uses the example of a question that appears to be in the single answer category--"What's the population of Boston?" Fact City offers 574,283 and cites the 1990 Census as its source. The same question at Ask Jeeves retrieves a "Where can I find the population of the city of Boston, MA?" question linked to the exact same answer.
Now maybe I'm splitting hairs here, but the students I see at the reference desk would not be satisfied with such an answer. It is 2000, after all, and not 1990. "Don't you have anything more recent?" A quick trip to the Census Bureau's Web site, followed by a click on the Estimates link under the People category would lead directly to a July 1, 1998 figure of 555,447.
Additionally, while the 1990 Census indeed gave 574,283 for the city of Boston, a good reference interview might find that the questioner wanted the population for the greater Boston area. In that case, the Census Bureau site estimates 5,633,060 people lived in the Boston-Worcester-Lawrence Consolidated Metropolitan Statistical Area while the Boston Primary Metropolitan Statistical Area had 3,289,096 as of July 1, 1998. Those pages even give yearly figures for 1990-1998 in case someone is looking at trends.

It is not that Ask Jeeves or Fact City gives an incorrect answer. But is it relevant? It is only relevant if the searcher is looking for nothing more recent than 1990 and if the question really only refers to the city itself.

BACK TO THE SEARCH ENGINES

How relevant are the results from other search engines? For the search boston population, AltaVista's top ten hits only contained two that looked as if they might have an answer, and both of those were dead links. Searching for population of b Os ton found much more promising results. But while the Sonaco City Guide has a more recent figure than the 1990 Census, it is not as current as the 1998 data on the Census site (and it fails to give a date or source for its figure). The Boston Historic Population Trends sounded good, but it turned out to be a dead link.
The story was much the same on the other search engines, even those that also had directory and additional database content to offer. Lycos, Excite, Northern Light, Infoseek, Yahoo!, HotBot, and Google all found pages. But they either used the older figures, were for other Bostons, or had no population numbers at all. The closest came from Google, which found a very current page on the Census Bureau's site. It actually included 1999 estimates, but only for the whole U.S. It had no links to city or metropolitan area population estimates.
The single answer approach search tools do give an accurate, but out-of- date, answer. Some of the results from search engines and directories provide pages with other old data. None of them found the detailed answers available at the Census Bureau's site, at least not with searches for Boston population or population of Boston.

THE BOTTOM LINE

Spending a bit of time with the search engines and evaluating them for relevance should make librarians feel secure in their jobs. There is an incredible amount of information available on the Net. The search engines are great tools for searching some of that information. The search engines continue to boast of improved relevance and are happy to offer individual examples of how their results are extremely relevant. And on some searches they are quite right.
But for all their improvements in relevance ranking techniques, there are plenty of searches where the techniques fail significantly. The technology will continue to improve. But while the science of relevance ranking may retrieve ever better possibilities, finding accurate and comprehensive answers will remain an art for some time to come.