htsearch

ht://Dig Copyright © 1995-2004 The ht://Dig Group
Please see the file COPYING for license information.


Search Method Used

The way htsearch performs it search and applies its ranking rules are fairly complicated. This is an attempt at explaining in global terms what goes on when htsearch searches.

htsearch gets a list of (case insensitive) words from the HTML form that invoked it. If htsearch was invoked with boolean expression parsing enabled, it will do a quick syntax check on the input words. If there are syntax errors, it will display the syntax error file that is specified with the syntax_error_file attribute.

If the boolean parser was not enabled, the list of words is converted into a boolean expression by putting either "and"s or "or"s between the words. (This depends on the search type.) Phrases within double quotes (") specify that the words must occur sequentially within the document.

If a word is immediately preceeded by a field specifer (title:, heading:, author:, keyword:, descr:, link:, url:) then it will only match documents in which the word occurred within field. For example, descr:foo only matches documents containing <meta value="description" value="... foo ...">. The link: field refers to the text in the hyperlinks to a document, rather than text within the document itself. Similarly url: (will eventually) refer to the actual URL of the document, not any of its contents. The prefixes exact: and hidden: are also accepted. The former (will) cause the fuzzy search algorithm not to be applied to this word, while the latter causes the word not to be displayed in the query string of the results page.

Each of the words in the list (but not within a phrase) is now expanded using the search algorithms that were specified in the search_algorithm attribute. For example, the endings algorithm will convert a word like "person" into "person or persons". In this fashion, all the specified algorithms are used on each of the words and the result is a new boolean expression.

The next step is to perform database lookups on the words in the expression. The result of these lookups are then passed to the boolean expression parser.

The boolean expression parser is a simple recursive descent parser with an operand stack. It knows how to deal with "not", "and", "or" and parenthesis. The result of the parser will be one set of matches.
Note that the operator "not" is used as the word 'without' and is binary: You can not write "cat and not dog" or just "not dog" but you can write "cat not dog".

At this point, the matches are ranked. The rank of a match is determined by the weight of the words that caused the match and the weight of the algorithm that generated the word. Word weights are generally determined by the importance of the word in a document. For example, words in the title of a document have a much higher weight than words at the bottom of the document.

Finally, when the document ranks have been determined and the documents sorted, the resulting matches are displayed. If paged output is required, only a subset of all the matches will be displayed.


Last modified: $Date: 2004/05/28 13:15:18 $