You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
2591 lines
136 KiB
HTML
2591 lines
136 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>ht://Dig Frequently Asked Questions</title>
|
|
<link rel="stylesheet" href="css/htdig.css">
|
|
</head>
|
|
<body bgcolor="#eef7ff">
|
|
<h1>Frequently Asked Questions</h1>
|
|
<p>
|
|
ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
|
|
Please see the file <a href="COPYING">COPYING</a> for
|
|
license information.
|
|
</p>
|
|
<hr noshade size=4>
|
|
<p class="main">This FAQ is compiled by the ht://Dig developers and the
|
|
most recent version is available at <<a
|
|
href="http://www.htdig.org/FAQ.html">http://www.htdig.org/FAQ.html</a>>.
|
|
Questions (and answers!) are greatly appreciated.
|
|
Please send questions and/or answers to the ht://Dig user
|
|
mailing list at: <<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>>.
|
|
</p>
|
|
<h2>Questions</h2>
|
|
|
|
<h3>1. General</h3>
|
|
1.1. <a href="#q1.1">Can I search the internet with ht://Dig?</a><br>
|
|
1.2. <a href="#q1.2">Can I index the internet with ht://Dig?</a><br>
|
|
1.3. <a href="#q1.3">What's the difference between htdig and
|
|
ht://Dig?</a><br>
|
|
1.4. <a href="#q1.4">I sent mail to Andrew or Geoff or
|
|
Gilles, but I never got a response!</a><br>
|
|
1.5. <a href="#q1.5">I sent a question to the mailing list but I
|
|
never got a response!</a><br>
|
|
1.6. <a href="#q1.6">I have a great idea/patch for ht://Dig!</a><br>
|
|
1.7. <a href="#q1.7">Is ht://Dig Y2K compliant?</a><br>
|
|
1.8. <a href="#q1.8">I think I found a bug. What should I do?</a><br>
|
|
1.9. <a href="#q1.9">Does ht://Dig support phrase or near
|
|
matching?</a><br>
|
|
1.10. <a href="#q1.10">What are the practical and/or theoretical
|
|
limits of ht://Dig?</a><br>
|
|
1.11. <a href="#q1.11">Do any ISPs offer ht://Dig as part of
|
|
their web hosting services?</a><br>
|
|
1.12. <a href="#q1.12">Can I use ht://Dig on a commercial website?</a><br>
|
|
1.13. <a href="#q1.13">Why do you use a non-free product to
|
|
index PDF files?</a><br>
|
|
1.14. <a href="#q1.14">Why do you have all those SourceForge
|
|
logos on your website?</a><br>
|
|
1.15. <a href="#q1.15">My question isn't answered here. Where should I
|
|
go for help?</a><br>
|
|
1.16. <a href="#q1.16">Why do the developers get annoyed when
|
|
I e-mail questions directly to them rather than the mailing list?</a><br>
|
|
1.17. <a href="#q1.17">Why do replies to messages on the
|
|
mailing list only go to the sender and not to the list?</a><br>
|
|
1.18. <a href="#q1.18">Can I use ht://Dig to index and search
|
|
an SQL database?</a><br>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>2. Getting ht://Dig</h3>
|
|
2.1. <a href="#q2.1">What's the latest version of ht://Dig?</a><br>
|
|
2.2. <a href="#q2.2">Are there binary distributions of ht://Dig?</a><br>
|
|
2.3. <a href="#q2.3">Are there mirror sites for ht://Dig?</a><br>
|
|
2.4. <a href="#q2.4">Is ht://Dig available by ftp?</a><br>
|
|
2.5. <a href="#q2.5">Are patches around to upgrade between
|
|
versions?</a><br>
|
|
2.6. <a href="#q2.6">Is there a Windows 95/98/2000/NT
|
|
version of ht://Dig?</a><br>
|
|
2.7. <a href="#q2.7">Where can I find the documentation for my
|
|
version of ht://Dig?</a><br>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>3. Compiling</h3>
|
|
3.1. <a href="#q3.1">When I compile ht://Dig I get an error
|
|
about libht.a.</a><br>
|
|
3.2. <a href="#q3.2">I get an error about -lg</a><br>
|
|
3.3. <a href="#q3.3">I'm compiling on Digital Unix and I get
|
|
mesages about "unresolved" and "db_open."</a><br>
|
|
3.4. <a href="#q3.4">I'm compiling on FreeBSD and I get lots
|
|
of messages about '___error' being unresolved.</a><br>
|
|
3.5. <a href="#q3.5">I'm compiling on HP/UX and I get a complaint about
|
|
"Large Files not supported."</a><br>
|
|
3.6. <a href="#q3.6">I'm compiling on Solaris and when I run the
|
|
programs I get complaints about not finding libstdc++.</a><br>
|
|
3.7. <a href="#q3.7">I'm compiling on IRIX and I'm having
|
|
database problems when I run the program.</a><br>
|
|
3.8. <a href="#q3.8">I'm compiling with gcc 3.2 and getting
|
|
all sorts of warnings/errors about ostream and such.</a><br>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>4. Configuration</h3>
|
|
4.1. <a href="#q4.1">How come I can't index my site?</a><br>
|
|
4.2. <a href="#q4.2">How can I change the output format of
|
|
htsearch?</a><br>
|
|
4.3. <a href="#q4.3">How do I index pages that start with '~'?</a><br>
|
|
4.4. <a href="#q4.4">Can I use multiple databases?</a><br>
|
|
4.5. <a href="#q4.5">OK, I can use multiple databases. Can I
|
|
merge them into one?</a><br>
|
|
4.6. <a href="#q4.6">Wow, ht://Dig eats up a lot of disk
|
|
space. How can I cut down?</a><br>
|
|
4.7. <a href="#q4.7">Can I use SSI or other CGIs in my
|
|
htsearch results?</a><br>
|
|
4.8. <a href="#q4.8">How do I index Word, Excel, PowerPoint
|
|
or PostScript documents?</a><br>
|
|
4.9. <a href="#q4.9">How do I index PDF files?</a><br>
|
|
4.10. <a href="#q4.10">How do I index documents in other
|
|
languages?</a><br>
|
|
4.11. <a href="#q4.11">How do I get rotating banner ads in
|
|
search results?</a><br>
|
|
4.12. <a href="#q4.12">How do I index numbers in documents?</a><br>
|
|
4.13. <a href="#q4.13">How can I call htsearch from a hypertext
|
|
link, rather than from a search form?</a><br>
|
|
4.14. <a href="#q4.14">How do I restrict a search to only meta
|
|
keywords entries in documents?</a><br>
|
|
4.15. <a href="#q4.15">Can I use meta tags to prevent htdig from
|
|
indexing certain files?</a><br>
|
|
4.16. <a href="#q4.16">How do I get htsearch to use the star image
|
|
in a different directory than the default /htdig?</a><br>
|
|
4.17. <a href="#q4.17">How do I get htdig or htsearch to rewrite
|
|
URLs in the search results?</a><br>
|
|
4.18. <a href="#q4.18">What are all the options in
|
|
htdig.conf, and are there others?</a><br>
|
|
4.19. <a href="#q4.19">How do I get more than 10 pages of
|
|
10 search results from htsearch?</a><br>
|
|
4.20. <a href="#q4.20">How do I restrict a search to only
|
|
certain subdirectories or documents?</a><br>
|
|
4.21. <a href="#q4.21">How can I allow people to search
|
|
while the index is updating?</a><br>
|
|
4.22. <a href="#q4.22">How can I get htdig to ignore the
|
|
robots.txt file or meta robots tags?</a><br>
|
|
4.23. <a href="#q4.23">How can I get htdig not to index
|
|
some directories, but still follow links?</a><br>
|
|
4.24. <a href="#q4.24">How can I get rid of duplicates in
|
|
search results?</a><br>
|
|
4.25. <a href="#q4.25">How can I change the scores in
|
|
search results, and what are the defaults?</a><br>
|
|
4.26. <a href="#q4.26">How can I get htdig not to index
|
|
JavaScript code or CSS?</a><br>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>5. Troubleshooting</h3>
|
|
5.1. <a href="#q5.1">I can't seem to index more than X documents
|
|
in a directory.</a><br>
|
|
5.2. <a href="#q5.2">I can't index PDF files.</a><br>
|
|
5.3. <a href="#q5.3">When I run "rundig," I get a message about
|
|
"DATABASE_DIR" not being found.</a><br>
|
|
5.4. <a href="#q5.4">When I run htmerge, it stops with an "out
|
|
of diskspace" message.</a><br>
|
|
5.5. <a href="#q5.5">I have problems running rundig from cron
|
|
under Linux.</a><br>
|
|
5.6. <a href="#q5.6">When I run htmerge, it stops with an
|
|
"Unexpected file type" message.</a><br>
|
|
5.7. <a href="#q5.7">When I run htsearch, I get lots of Internal
|
|
Server Errors (#500).</a><br>
|
|
5.8. <a href="#q5.8">I'm having problems with indexing words
|
|
with accented characters.</a><br>
|
|
5.9. <a href="#q5.9">When I run htmerge, it stops with a
|
|
"Word sort failed" message.</a><br>
|
|
5.10. <a href="#q5.10">When htsearch has a lot of matches, it runs
|
|
extremely slowly.</a><br>
|
|
5.11. <a href="#q5.11">When I run htsearch, it gives me a count of
|
|
matches, but doesn't list the matching documents.</a><br>
|
|
5.12. <a href="#q5.12">I can't seem to index documents with names
|
|
like left_index.html with htdig.</a><br>
|
|
5.13. <a href="#q5.13">I get Premature End of Script Headers errors
|
|
when running htsearch.</a><br>
|
|
5.14. <a href="#q5.14">I get Segmentation faults when running
|
|
htdig, htsearch or htfuzzy.</a><br>
|
|
5.15. <a href="#q5.15">Why does htdig 3.1.3 mangle URL parameters
|
|
that contain bare "&" characters?</a><br>
|
|
5.16. <a href="#q5.16">When I run htmerge, it stops with an
|
|
"Unable to open word list file '.../db.wordlist'" message.</a><br>
|
|
5.17. <a href="#q5.17">When using Netscape, htsearch always returns the
|
|
"No match" page.</a><br>
|
|
5.18. <a href="#q5.18">Why doesn't htdig follow links to other
|
|
pages in JavaScript code?</a><br>
|
|
5.19. <a href="#q5.19">When I run htsearch from the web server,
|
|
it returns a bunch of binary data.</a><br>
|
|
5.20. <a href="#q5.20">Why are the betas of 3.2 so slow at indexing?</a><br>
|
|
5.21. <a href="#q5.21">Why does htsearch use ";" instead of
|
|
"&" to separate URL parameters for the page buttons?</a><br>
|
|
5.22. <a href="#q5.22">Why does htsearch show the
|
|
"&" character as "&amp;" in search results?</a><br>
|
|
5.23. <a href="#q5.23">I get Internal Server or Unrecognized
|
|
character errors when running htsearch.</a><br>
|
|
5.24. <a href="#q5.24">I took some settings out of
|
|
my htdig.conf but they're still set.</a><br>
|
|
5.25. <a href="#q5.25">When I run htdig on my site,
|
|
it misses entire directories.</a><br>
|
|
5.26. <a href="#q5.26">What do all the numbers and symbols
|
|
in the htdig -v output mean?</a><br>
|
|
5.27. <a href="#q5.27">Why is htdig rejecting some of the
|
|
links in my documents?</a><br>
|
|
5.28. <a href="#q5.28">When I run htdig or htmerge, I get a
|
|
"DB2 problem...: missing or empty key value specified" message.</a><br>
|
|
5.29. <a href="#q5.29">When I run htdig on my site,
|
|
it seems to go on and on without ending.</a><br>
|
|
5.30. <a href="#q5.30">Why does htsearch no longer recognize
|
|
the -c option when run from the web server?</a><br>
|
|
5.31. <a href="#q5.31">I've set a config attribute exactly
|
|
as documented but it seems to have no effect.</a><br>
|
|
5.32. <a href="#q5.32">When I run htsearch, it gives a page
|
|
with an "Unable to read configuration file" message.</a><br>
|
|
5.33. <a href="#q5.33">How can I find out which version
|
|
of ht://Dig I have installed?</a><br>
|
|
5.34. <a href="#q5.34">When running htdig, I get "Error (0):
|
|
PDF file is damaged - attempting to reconstruct xref table..."</a><br>
|
|
5.35. <a href="#q5.35">When running htdig on Mandrake Linux,
|
|
I get "host not found" and "no server running" errors.</a><br>
|
|
5.36. <a href="#q5.36">When I run htsearch, it gives me the
|
|
list of matching documents, but no header or footer.</a><br>
|
|
5.37. <a href="#q5.37">When I index files with doc2html.pl,
|
|
it fails with the "UNABLE to convert" error.</a><br>
|
|
5.38. <a href="#q5.38">Why do my searches find search terms
|
|
in pathnames, or how do I prevent matching filenames?</a><br>
|
|
5.39. <a href="#q5.39">I set up an external parser but I still
|
|
can't index Word/Excel/PowerPoint/PDF documents.</a><br>
|
|
|
|
<hr noshade size=4>
|
|
<h2>Answers</h2>
|
|
|
|
<h3>1. General</h3>
|
|
<strong>1.1. <a name="q1.1">Can I search the internet with
|
|
ht://Dig?</a></strong><br>
|
|
<p>No, ht://Dig is a system for indexing and searching a
|
|
finite (not necessarily small) set of sites or intranet. It
|
|
is not meant to replace any of the many internet-wide search
|
|
engines.</p>
|
|
|
|
<strong>1.2. <a name="q1.2">Can I index the internet with
|
|
ht://Dig?</a></strong><br>
|
|
<p>No, as above, ht://Dig is not meant as an
|
|
internet-wide search engine. While there is
|
|
<em>theoretically</em> nothing to stop you from indexing as
|
|
much as you wish, practical considerations (e.g. time, disk
|
|
space, memory, etc.) will limit this.</p>
|
|
|
|
<strong>1.3. <a name="q1.3">What's the difference between htdig and
|
|
ht://Dig?</a></strong><br>
|
|
<p>The complete ht://Dig package consists of several programs, one of
|
|
which is called "htdig." This program performs the "digging" or
|
|
indexing of the web pages. Of course an index doesn't do you much good
|
|
without a program to sort it, search through it, etc.</p>
|
|
|
|
<strong>1.4. <a name="q1.4">I sent mail to Andrew or Geoff
|
|
or Gilles, but I never got a response!</a></strong><br>
|
|
<p>Andrew no longer does much work on ht://Dig. He has started a
|
|
company, called <a href="http://www.contigo.com/">Contigo
|
|
Software</a> and is quite busy with that. To contact any of the
|
|
current developers, send mail to <<a
|
|
href="mailto:htdig-dev@lists.sourceforge.net">htdig-dev</a>>.
|
|
This list is intended primarily for the discussion of current
|
|
and future development of the software.</p>
|
|
|
|
<p>Geoff and Gilles are currently the maintainers of
|
|
ht://Dig, but they are both volunteers. So while they do
|
|
read all the e-mail they receive, they may not respond
|
|
immediately. Questions about ht://Dig in general, and especially
|
|
questions or requests for help in configuring the software,
|
|
should be posted to the <<a
|
|
href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>>
|
|
mailing list. When posting a followup to a message on the
|
|
list, you should use the "reply to all" or "group reply"
|
|
feature of your mail program, to make sure the mailing list
|
|
address is included in the reply, rather than replying only
|
|
to the author of the message.
|
|
See also question <a href="#q1.16">1.16</a> and the
|
|
<a href="http://www.htdig.org/mailarchive.html">mailing list</a>
|
|
page.</p>
|
|
|
|
<strong>1.5. <a name="q1.5">I sent a question to the mailing list but I
|
|
never got a response!</a></strong><br>
|
|
<p>Development of ht://Dig is done by volunteers. Since we all
|
|
have other jobs, it make take a while before someone gets back
|
|
to you. Please be patient and don't hound the volunteers with
|
|
direct or repeated requests. If you don't get a response after
|
|
3 or 4 days, then a reminder may help.
|
|
See also question <a href="#q1.16">1.16</a>.</p>
|
|
|
|
<strong>1.6. <a name="q1.6">I have a great idea/patch for
|
|
ht://Dig!</a></strong><br>
|
|
<p>Great! Development of ht://Dig continues through suggestions
|
|
and improvements from users. If you have an idea (or even better,
|
|
a patch), please send it to the ht://Dig mailing list so others
|
|
can use it. For suggestions on how to submit patches, please check
|
|
the <a href="dev/patches.html">Guidelines for
|
|
Patch Submissions</a>. If you'd like to make a feature request,
|
|
you can do so through the <a href="bugs.html">ht://Dig bug
|
|
database</a></p>
|
|
|
|
<strong>1.7. <a name="q1.7">Is ht://Dig Y2K compliant?</a></strong><br>
|
|
<p>
|
|
ht://Dig should be y2k compliant since it never <em>stores</em> dates as
|
|
two-digit years. Under ht://Dig's copyright (GPL), there is no warranty
|
|
whatsoever as permitted by law. If you would like an iron-clad,
|
|
legally-binding guarantee, feel free to check the source code
|
|
itself. Versions prior to 3.1.2 did have a problem with the parsing
|
|
of the Last-Modified header returned by the HTTP server, which will
|
|
cause incorrect dates to be stored for documents modified after
|
|
February 28, 2000 (yes, it didn't recognize 2000 as a leap year).
|
|
Versions prior to 3.1.5 didn't correctly handle servers that return
|
|
two digit years in the Last-Modified header, for years after 99.
|
|
These problems are fixed in the current release.
|
|
If you discover something else, please let us know!
|
|
</p>
|
|
|
|
<strong>1.8. <a name="q1.8">I think I found a bug. What should I
|
|
do?</a></strong><br>
|
|
<p>Well, there are probably bugs out there. You have two options
|
|
for bug-reporting. You can either mail the ht://Dig mailing list
|
|
at <<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general@lists.sourceforge.net</a>> or
|
|
better yet, report it to the <a href="bugs.html">bug
|
|
database</a>, which ensures it won't
|
|
become lost amongst all of the other mail on the list.
|
|
Please try to include as much information as possible, including
|
|
the version of ht://Dig (see question <a href="#q5.33">5.33</a>),
|
|
the OS, and anything else that might be helpful.
|
|
Often, running the programs with one "-v" or more
|
|
(e.g. "-vvv") gives useful debugging information.
|
|
If you are unsure whether the problem is a bug or a configuration
|
|
problem, you should discuss the problem on
|
|
<<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>>
|
|
(after carefully reading the FAQ and searching the
|
|
<a href="http://www.htdig.org/mailarchive.html">mail archive</a>
|
|
and <a href="#q2.5">patch archive</a>,
|
|
of course)
|
|
to sort out what it is. The mailing list has a wider audience, so
|
|
you're more likely to get help with configuration problems there
|
|
than by reporting them to the bug database.
|
|
</p>
|
|
|
|
<p>Whether reporting problems to the bug database or mailing
|
|
list, we cannot stress enough the importance of
|
|
<strong>always</strong> indicating <strong>which version of
|
|
ht://Dig you are running</strong>.
|
|
See question <a href="#q5.33">5.33</a>. There
|
|
are still a lot of users, ISPs and software distributors using
|
|
older versions, and there have been a lot of bug fixes and
|
|
new features added in recent versions. Knowing which version
|
|
you're running is absolutely essential in helping to find a
|
|
solution. If you're unsure if your version is current, or what
|
|
fixes and features have been added in more recent versions,
|
|
please see the <a href="RELEASE.html">
|
|
release notes</a>. See also question <a href="#q2.1">2.1</a>.</p>
|
|
|
|
<strong>1.9. <a name="q1.9">Does ht://Dig support phrase or near
|
|
matching?</a></strong><br>
|
|
<p>Phrase searching has been added for the 3.2 release,
|
|
which is currently in the beta phase
|
|
(<a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>
|
|
as of this writing). Near or proximity matching will probably be added
|
|
in a future beta.
|
|
</p>
|
|
|
|
<strong>1.10. <a name="q1.10">What are the practical and/or theoretical
|
|
limits of ht://Dig?</a></strong><br>
|
|
<p>The code itself doesn't put any real limit on the number of
|
|
pages. There are several sites in the hundreds of thousands
|
|
of pages. As for practical limits, it depends a lot on how
|
|
many pages you plan on indexing. Some operating systems limit
|
|
files to 2 GB in size, which can become a problem with a large
|
|
database. There are also slightly different limits to each of
|
|
the programs. Right now htmerge performs a sort on the words
|
|
indexed. Most sort programs use a fair amount of RAM and
|
|
temporary disk space as they assemble the sorted list. The
|
|
htdig program stores a fair amount of information about the
|
|
URLs it visits, in part to only index a page once. This takes
|
|
a fair amount of RAM. With cheap RAM, it never hurts to throw
|
|
more memory at indexing larger sites. In a pinch, swap will
|
|
work, but it obviously really slows things down.</p>
|
|
|
|
<p>The 3.2 development code helps with many of these
|
|
limitations. In paticular, it generates the databases on the
|
|
fly, which means you don't have to sort them before
|
|
searching. Additionally, the new databases are compressed
|
|
significantly, making them usually around 50% the size of
|
|
those in previous versions.</p>
|
|
|
|
<strong>1.11. <a name="q1.10">Do any ISPs offer ht://Dig as part of
|
|
their web hosting services?</a></strong><br>
|
|
<p>Yes. A list of such ISPs is <a href="isp.html">available
|
|
here</a>
|
|
</p>
|
|
|
|
<strong>1.12. <a name="q1.12">Can I use ht://Dig on a
|
|
commercial website?</a></strong><br>
|
|
<p>Sure! The <a href="COPYING">GNU Library General Public License (LGPL)</a> has no
|
|
restrictions on use. So you are free to use ht://Dig however you
|
|
want on your website, personal files, etc. The license only
|
|
restricts distribution. So if you're planning on a
|
|
commercial software product that includes ht://Dig, you will
|
|
have to provide source code including any modifications upon
|
|
request.
|
|
</p>
|
|
|
|
<strong>1.13. <a name="q1.13">Why do you use a non-free
|
|
product to index PDF files?</a></strong><br>
|
|
<p>
|
|
We don't. You <em>can</em> use the "acroread"
|
|
program to index PDF files, but this is no longer
|
|
recommended. Initially this program was the only reliable
|
|
way to extract data from PDF files. However, the <a
|
|
href="http://www.foolabs.com/xpdf/">xpdf package</a> is a
|
|
reliable, free software package for indexing and viewing PDF
|
|
files. See question <a href="#q4.9">4.9</a> for details on
|
|
using xpdf to index PDF files. We do not advocate using
|
|
acroread any longer because it is a proprietary product.
|
|
Additionally it is no longer reliable at extracting data.
|
|
</p>
|
|
|
|
<strong>1.14. <a name="q1.14">Why do you have all those SourceForge
|
|
logos on your website?</a></strong><br>
|
|
<p><a href="http://sourceforge.net/">SourceForge</a> is a
|
|
new service for open source software. You can host your
|
|
project on SourceForge servers and use many of their
|
|
services like bug-tracking and the like. The ht://Dig
|
|
project currently uses SourceForge for a mirror of the main
|
|
website at <a
|
|
href="http://htdig.sourceforge.net/">htdig.sourceforge.net</a>
|
|
as well as a mirror of ht://Dig releases and contributed
|
|
work.
|
|
</p>
|
|
|
|
<strong>1.15. <a name="q1.15">My question isn't answered here.
|
|
Where should I go for help?</a></strong><br>
|
|
<p>
|
|
Before you go anywhere else, think of other ways of phrasing your
|
|
question. Many times people have questions that are very similar to
|
|
other FAQ and while we try to phrase the queries in the FAQ closely to
|
|
the most common questions, we obviously can't get them all! The next
|
|
place to check is the documentation itself. In particular, take a
|
|
look at the list of configuration attributes, particularly the list <a
|
|
href="cf_byname.html">by name</a> and <a
|
|
href="cf_byprog.html">by program</a>. There are a
|
|
lot of them, but chances are there's something that might fit your needs.
|
|
You should also take a close look at all of
|
|
<a href="htsearch.html">htsearch</a>'s
|
|
documentation, especially the section "HTML form" which describes
|
|
all the CGI input parameters available for controlling the search,
|
|
including limiting the search to certain subdirectories.
|
|
You can find the answer yourself to almost all "how can I..."
|
|
questions by exploring what the various configuration attributes
|
|
and search form input parameters can do.
|
|
Also have a look at our collection of
|
|
<a href="http://www.htdig.org/contrib/guides.html">Contributed Guides</a>
|
|
for help on things like
|
|
<a href="http://www.htdig.org/files/contrib/guides/htmlhelp.html">HTML
|
|
forms</a> and CGI, tutorials on installing, configuring, using, and
|
|
internationalizing ht://Dig, as well as using PHP with htsearch.
|
|
</p>
|
|
<p>
|
|
Finally, if you've exhausted all the online documentation, there's the
|
|
<a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a> mailing list.
|
|
There are hundreds of users subscribed and chances are good that someone
|
|
has had a similar problem before or can suggest a solution.
|
|
</p>
|
|
|
|
<strong>1.16. <a name="q1.16">Why do the developers get annoyed when
|
|
I e-mail questions directly to them rather than the mailing list?</a></strong><br>
|
|
<p>The <a href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
|
|
mailing list exists for dealing with questions about the
|
|
software, its installation, configuration, and problems with
|
|
it. E-mailing the developers directly circumvents this forum
|
|
and its benefits. Most annoyingly, it puts the onus on an
|
|
individual to answer, even if that individual is not the best or
|
|
most qualified person to answer. This is not a one-man show. It
|
|
also circumvents the <a href="http://www.htdig.org/mailarchive.html">archiving
|
|
mechanism</a> of the mailing list,
|
|
so not only do subscribers not see these private messages
|
|
and replies, but future users who may run into the exact same
|
|
problems won't see them. Remember that the developers are all
|
|
volunteers, and they don't work for free for your benefit alone.
|
|
They volunteer for the benefit of the whole ht://Dig user
|
|
community, so don't expect extra support from them outside of
|
|
that community. See also questions <a href="#q1.4">1.4</a>
|
|
and <a href="#q1.5">1.5</a>.</p>
|
|
|
|
<p>Note also that when you reply to a message on the list, you
|
|
should make sure the reply gets on the list as well, provided your
|
|
reply is still on-topic. See question <a href="#q1.17">1.17</a>
|
|
below.</p>
|
|
|
|
<strong>1.17. <a name="q1.17">Why do replies to messages on the
|
|
mailing list only go to the sender and not to the list?</a></strong><br>
|
|
<p>The simple answer is that, unlike some mailing lists, the
|
|
lists on SourceForge don't force replies back on the list. This
|
|
is actually a good thing, because you can reply to the sender
|
|
directly if you want to, or you can use your mail program's
|
|
"reply to all" capability (sometimes called "group reply")
|
|
to reply to the mailing list as well. It does mean you have to
|
|
think before you post a reply, but some would argue that this
|
|
is a good thing too. There are some compelling reasons to try to
|
|
keep on-topic discussions on the list, though (see questions
|
|
<a href="#q1.16">1.16</a> and <a href="#q1.4">1.4</a> above).</p>
|
|
|
|
<p>The technical answer is
|
|
<a href="http://sourceforge.net/docman/display_doc.php?docid=6693&group_id=1">
|
|
SourceForge's policy on Reply-To: munging</a>, where you'll
|
|
find all the gory details about the pros and cons of the two
|
|
common ways of setting up a mailing list, and why SourceForge
|
|
turns off Reply-To munging. It so happens that the ht://Dig
|
|
maintainers agree with SourceForge's policy on this, even if
|
|
we did have a say in the matter. So, counterarguments to this
|
|
policy are rather moot, and it would be better not to waste
|
|
any more mailing list bandwidth debating them. (We've heard
|
|
all the arguments anyway.)</p>
|
|
|
|
<strong>1.18. <a name="q1.18">Can I use ht://Dig to index and search
|
|
an SQL database?</a></strong><br>
|
|
<p>You can if your database has a web-based front end that can
|
|
be "spidered" by ht://Dig. The requirement is that every search
|
|
result must resolve to a unique URL which can be accessed via
|
|
HTTP. The htdig program uses these URLs, which you feed it via
|
|
the <a href="attrs.html#start_url">start_url</a> attribute, to
|
|
fetch and index each page of information. The search results
|
|
will then give a list of URLs for all pages that match the
|
|
search terms. If you don't have such a front end to your
|
|
database, or the search results must be given as something
|
|
other than URLs, then ht://Dig is probably not the best way of
|
|
dealing with this problem: you may be better off using an SQL
|
|
query engine that works directly on your own database, rather
|
|
than building a separate ht://Dig database for searching.</p>
|
|
|
|
<p>Ted Stresen-Reuter had the following tips: "In my case,
|
|
because I like htdig's ability to rank results (and that
|
|
ranking can be modified), I created an index page that simply
|
|
walks through each record and indexes each record (with
|
|
<em>next</em> and <em>previous</em> links so the spider can
|
|
read all the records). And then I do one other thing: I make
|
|
the <code><title></code> tag start with the unique ID
|
|
of each record. Then, when I'm parsing the search results, I
|
|
do a lookup on the database using the title tag as the key."</p>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>2. Getting ht://Dig</h3>
|
|
<strong>2.1. <a name="q2.1">What's the latest version of ht://Dig?</a></strong><br>
|
|
<p>The latest version is 3.1.6 as of this writing. A beta
|
|
version of the 3.2 code,
|
|
<a href="http://www.htdig.org/files/htdig-3.2.0b6.tar.gz">3.2.0b6</a>,
|
|
is also available, for those who wish to test it.
|
|
You can find out about the latest version by reading the
|
|
<a href="RELEASE.html">release
|
|
notes</a>.</p>
|
|
|
|
<p><strong>Note</strong> that if you're running any version
|
|
older than 3.1.5 (including 3.2.0b1) on a public web site,
|
|
you should upgrade immediately, as older versions have a
|
|
rather serious security hole which is explained in detail in
|
|
this <a
|
|
href="http://www.htdig.org/htdig-dev/2000/02/0272.html">advisory</a>
|
|
which was sent to the Bugtraq mailing list.
|
|
Another slightly less serious, but still troubling security hole
|
|
exists in 3.1.5 and older (including 3.2.0b3 and older), so you
|
|
should upgrade if you're running one of these. You can view details
|
|
on this vulnerability from the
|
|
<a href="http://www.securityfocus.com/bid/3410">bugtraq mailing list.</a>
|
|
If you're unsure of which version you're running, see question
|
|
<a href="#q5.33">5.33</a>.</p>
|
|
|
|
<strong>2.2. <a name="q2.2">Are there binary distributions of
|
|
ht://Dig?</a></strong><br>
|
|
<p>We're trying to get consistent binary distributions for
|
|
popular platforms. Contributed binary releases will go in <a
|
|
href="http://www.htdig.org/files/contrib/binaries/">
|
|
the contributed binaries section</a>
|
|
and contributions should be mentioned to the <a
|
|
href="mailto:htdig-general@lists.sourceforge.net">htdig-general</a>
|
|
mailing list.
|
|
|
|
<p>Anyone who would like to make consistent binary
|
|
distributions of ht://Dig at least should signup to the <a
|
|
href="mailing.html">htdig-announce mailing list</a>.</p>
|
|
|
|
<strong>2.3. <a name="q2.3">Are there mirror sites for ht://Dig?</a></strong><br>
|
|
<p>Yes, see our <a href="mirrors.html">mirrors
|
|
listing</a>. If you'd like to mirror the site, please see
|
|
the <a href="howto-mirror.html">mirroring guide</a>.</p>
|
|
|
|
<strong>2.4. <a name="q2.4">Is ht://Dig available by ftp?</a></strong><br>
|
|
<p>Yes. You can find the current versions and several older
|
|
versions at various <<a
|
|
href="mirrors.html">mirror sites</a>>
|
|
as well as the other locations mentioned in the <a
|
|
href="where.html">download page</a>.</p>
|
|
|
|
<strong>2.5. <a name="q2.5">Are patches around to upgrade between
|
|
versions?</a></strong><br>
|
|
<p>Most versions are also distributed as a patch to the previous
|
|
version's source code. The most recent exception to this was
|
|
version 3.1.0b1. Since this version switched from the GDBM
|
|
database to DB2, the new database package needed to be shipped
|
|
with the distribution. This made the potential patch almost as large
|
|
as the regular distribution. Update patches resumed with version
|
|
3.1.0b2. You can also find archives of patches submitted to
|
|
the htdig mailing lists, to fix specific bugs or add features,
|
|
at Joe Jah's <a href="ftp://ftp.ccsf.org/htdig-patches/">
|
|
htdig-patches ftp site</a>.</p>
|
|
|
|
<strong>2.6. <a name="q2.6">Is there a Windows 95/98/2000/NT
|
|
version of ht://Dig?</a></strong><br>
|
|
<p>The ht://Dig package can be built on the Win32 platform when
|
|
using the Cygwin package. For details, see the contributed guide,
|
|
<a href="http://www.htdig.org/files/contrib/guides/Installing_on_Win32.html">
|
|
<em>Idiot's Guide to Installing ht://Dig on Win32</em></a>.
|
|
</p>
|
|
<p>
|
|
As of the <a href="http://www.htdig.org/files/htdig-3.2.0b5.tar.gz">3.2.0b5</a>
|
|
beta release, there is also native Win32 support, thanks to
|
|
Neal Richter. (Installation docs will be written soon...)
|
|
</p>
|
|
|
|
<strong>2.7. <a name="q2.7">Where can I find the documentation for my
|
|
version of ht://Dig?</a></strong><br>
|
|
<p>The documentation for the most recent stable release is always
|
|
posted at <a href="http://www.htdig.org/">www.htdig.org</a>.
|
|
The documentation for the latest beta release can be found at
|
|
<a href="http://www.htdig.org/dev/htdig-3.2/">http://www.htdig.org/dev/htdig-3.2/</a>.
|
|
In all releases, the documentation is included in the
|
|
<strong>htdoc</strong> subdirectory of the source distribution, so
|
|
you always have access to the documentation for your current version.
|
|
</p>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>3. Compiling</h3>
|
|
<strong>3.1. <a name="q3.1">When I compile ht://Dig I get an error about
|
|
libht.a</a></strong><br>
|
|
<p>This usually indicates that either libstdc++ is not installed or
|
|
is installed incorrectly. To get libstdc++ or any other GNU too,
|
|
check
|
|
<a
|
|
href="ftp://ftp.gnu.org/gnu/">ftp://ftp.gnu.org/gnu/</a>.
|
|
Note that the most recent versions of gcc come with
|
|
libstdc++ included and are available from <a
|
|
href="http://gcc.gnu.org/">http://gcc.gnu.org/</a></p>
|
|
|
|
<strong>3.2. <a name="q3.2">I get an error about -lg</a></strong><br>
|
|
<p>This is due to a bug in the Makefile.config.in of version
|
|
3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then
|
|
type "./config.status" to rebuild the Makefiles and
|
|
recompile. This bug is fixed in version 3.1.0b2.</p>
|
|
|
|
<strong>3.3. <a name="q3.3">I'm compiling on Digital Unix and I get
|
|
mesages about "unresolved" and "db_open."</a></strong><br>
|
|
<p>Answer contributed by George Adams
|
|
<learningapache@my-dejanews.com></p>
|
|
|
|
<p>What you're seeing are problems related to the Berkeley DB
|
|
library. htdig needs a fairly modern version of db, which is
|
|
why it ships with one that works. (see that -L../db-2.4.14/dist
|
|
line? That's where htdig's db library is).<br>
|
|
|
|
The solution is to modify the c++ command so it explicity
|
|
references the correct libdb.a . You can do this by replacing
|
|
the "-ldb" directive in the c++ command with
|
|
"../db-2.4.14/dist/libdb.a" This problem has been resolved as of
|
|
version 3.1.0.</p>
|
|
|
|
<strong>3.4. <a name="q3.4">I'm compiling on FreeBSD and I get lots
|
|
of messages about '___error' being unresolved.</a></strong><br>
|
|
<p>Answer contributed by Laura Wingerd <laura@perforce.com><br>
|
|
I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking
|
|
-D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, in
|
|
db/dist/configure.</p>
|
|
|
|
<strong>3.5. <a name="q3.5">I'm compiling on HP/UX and I get a complaint about
|
|
"Large Files not supported."</a></strong><br>
|
|
<p>The db/ pacakge, included with ht://Dig seems to be unable to complete
|
|
on HP/UX 10.20 in particular. After running the top-level configure
|
|
script, cd into db/dist and type:</p>
|
|
<code>./configure --disable-bigfile</code>
|
|
<p>Then continue with the normal compilation.</p>
|
|
|
|
<strong>3.6. <a name="q3.6">I'm compiling on Solaris and when I run the
|
|
programs I get complaints about not finding libstdc++.</a></strong><br>
|
|
<p>Answer contributed by Adam Rice <adam@newsquest.co.uk></p>
|
|
<p>The problem is that the Solaris loader can't find the library. The
|
|
best thing to do is set the LD_RUN_PATH environment variable <em>during compile</em>
|
|
to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker
|
|
to search that directory at runtime.
|
|
</p>
|
|
|
|
<p>Note that LD_RUN_PATH is not to be confused with LD_LIBRARY_PATH.
|
|
The latter is parsed at run-time, while LD_RUN_PATH essentially
|
|
compiles in a library path into the executable, so that it doesn't
|
|
need a LD_LIBRARY_PATH setting to find its libraries. This allows
|
|
you to avoid all the complexities of setting an environment
|
|
variable for a CGI program run from the server. If all else fails,
|
|
you can always run your programs from wrapper shell scripts that
|
|
set the LD_LIBRARY_PATH environment variable appropriately.</p>
|
|
|
|
<p>Note also that while this answer is specific to Solaris, it may
|
|
work for other OSes too, so you may want to give it a try. However,
|
|
not all versions of the <code>ld</code> program on all OSes support
|
|
the LD_RUN_PATH environment variable, even if these systems support
|
|
shared libraries. Try "<code>man ld</code>" on your system to
|
|
find out the best way of setting the runtime search path for shared
|
|
libraries. If <code>ld</code> doesn't support LD_RUN_PATH, but does
|
|
support the <code>-R</code> option, you can add one or more of these
|
|
options to LIBDIRS in Makefile.config before running make on a 3.1.x
|
|
release. (For a 3.2 beta release, you can add these options to the
|
|
LDFLAGS environment variable before you run ./configure.)</p>
|
|
|
|
<strong>3.7. <a name="q3.7">I'm compiling on IRIX and I'm having
|
|
database problems when I run the program.</a></strong><br>
|
|
<p>
|
|
It is not entirely clear why these problems occur, though
|
|
they seem to only happen when older compilers are
|
|
used. Several people have reported that the problems go away
|
|
when using the latest version of <a href="http://gcc.gnu.org/">gcc</a>.
|
|
</p>
|
|
|
|
<strong>3.8. <a name="q3.8">I'm compiling with gcc 3.2 and getting
|
|
all sorts of warnings/errors about ostream and such.</a></strong><br>
|
|
<p>
|
|
With versions before 3.2.0b5,
|
|
you should use the following command to configure the ht://Dig
|
|
package so it can be built with gcc 3.2:
|
|
<pre>
|
|
CXXFLAGS=-Wno-deprecated CPPFLAGS=-Wno-deprecated ./configure
|
|
</pre>
|
|
</p>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>4. Configuration</h3>
|
|
<strong>4.1. <a name="q4.1">How come I can't index my site?</a></strong><br>
|
|
<p>There are a variety of reasons ht://Dig won't index a
|
|
site. To get to the bottom of things, it's advisable to turn on
|
|
some debugging output from the htdig program. When running from
|
|
the command-line, try "-vvv" in addition to any other
|
|
flags. This will add debugging output, including the responses
|
|
from the server.</p>
|
|
<p>See also questions <a href="#q5.25">5.25</a>,
|
|
<a href="#q5.27">5.27</a>, <a href="#q5.16">5.16</a> and
|
|
<a href="#q5.18">5.18</a>.</p>
|
|
|
|
<strong>4.2. <a name="q4.2">How can I change the output format of htsearch?</a></strong><br>
|
|
<p>Answer contributed by: Malki Cymbalista <Malki.Cymbalista@weizmann.ac.il></p>
|
|
|
|
<p>You can change the output format of htsearch by creating different
|
|
header, footer and result files that specify how you want the output
|
|
to look. You then create a configuration file that specifies which
|
|
files to use. In the html document that links to the search, you
|
|
specify which configuration file to use.</p>
|
|
|
|
<p>So the configuration file would have the lines:</p>
|
|
<pre>
|
|
search_results_header: ${common_dir}/ccheader.html
|
|
search_results_footer: ${common_dir}/ccfooter.html
|
|
template_map: Long long builtin-long \
|
|
Short short builtin-short \
|
|
Default default ${common_dir}/ccresult.html
|
|
template_name: Default
|
|
</pre>
|
|
<p>You would also put into the configuration file any other lines from the
|
|
default configuration file that apply to htsearch.</p>
|
|
|
|
<p>The files ${common_dir}/ccheader.html and
|
|
${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be
|
|
tailored to give the output in the desired format.</p>
|
|
|
|
<p>Assuming your configuration file is called cc.conf, the html file that
|
|
links to the search has to set the config parameter equal to cc. The
|
|
following line would do it:<br>
|
|
<code><input type="hidden" name="config" value="cc"></code></p>
|
|
|
|
<p><strong>Note:</strong> Don't just add the line above to your
|
|
<a href="hts_form.html">search form</a>
|
|
without checking if there isn't already a similar
|
|
line giving the config attribute a different value. The sample
|
|
search.html form that comes with the package includes a line
|
|
like this already, giving "config" the default value of "htdig".
|
|
If it's there, modify it instead of adding another definition.
|
|
The config input parameter doesn't need to be hidden either, and
|
|
you may want to define it as a pull-down list to select different
|
|
databases (see question <a href="#q4.4">4.4</a>).</p>
|
|
|
|
<strong>4.3. <a name="q4.3">How do I index pages that start with '~'?</a></strong><br>
|
|
<p>
|
|
ht://Dig should index pages starting with '~' as if it was another
|
|
web browser. If you are having problems with this, check your server
|
|
log files to see what file the server is attempting to return.
|
|
</p>
|
|
|
|
<strong>4.4. <a name="q4.4">Can I use multiple databases?</a></strong><br>
|
|
<p>Yes, though you may find it easier to have one larger
|
|
database and use restrict or exclude fields on searches. To use
|
|
multiple databases, you will need a config file for each
|
|
database. Then each file will set the
|
|
<a href="attrs.html#database_dir">database_dir</a> or
|
|
<a href="attrs.html#database_base">database_base</a> attribute to
|
|
change the name of the databases. The config file is selected
|
|
by the <strong>config</strong> input field in the search form.
|
|
<br>See also questions <a href="#q4.2">4.2</a> and
|
|
<a href="#q4.20">4.20</a>.</p>
|
|
|
|
<strong>4.5. <a name="q4.5">OK, I can use multiple databases. Can I
|
|
merge them into one?</a></strong><br>
|
|
<p>As of version 3.1.0, you can do this with the -m option to
|
|
<a href="htmerge.html">htmerge</a>.</p>
|
|
|
|
<strong>4.6. <a name="q4.6">Wow, ht://Dig eats up a lot of disk
|
|
space. How can I cut down?</a></strong><br>
|
|
<p>There are several ways to cut down on disk space. One is
|
|
not to use the "-a" option, which creates work copies of the
|
|
databases. Naturally this essentially doubles the disk
|
|
usage. If you don't need to index and search at the same time, you can
|
|
ignore this flag.</p>
|
|
|
|
<p>If you are running 3.2.0b5 or higher and don't have
|
|
<a href="dev/htdig-3.2/attrs.html#wordlist_compress_zlib">compression</a>
|
|
turned on, then turning that on will also save considerable space.</p>
|
|
|
|
<p>Changing configuration variables can also help cut
|
|
down on disk usage. Decreasing
|
|
<a href="attrs.html#max_head_length">max_head_length</a> and
|
|
<a href="attrs.html#max_meta_description_length">max_meta_description_length</a>
|
|
will cut down on the size of the excerpts stored (in fact, if you
|
|
don't have
|
|
<a href="attrs.html#use_meta_description">use_meta_description</a>
|
|
set, you can set
|
|
max_meta_description_length to 0!).</p>
|
|
|
|
<p>If you are running 3.2.0b6 or higher, you can turn off
|
|
<a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a>. This cuts the
|
|
database size by about 60%, at the expense of severely limiting
|
|
the effectiveness of phrase searches. It also reduces digging time
|
|
slightly.</p>
|
|
|
|
<p>Other techniques include removing the db.wordlist file and adding
|
|
more words to the <a href="attrs.html#bad_words">bad_words</a>
|
|
file.</p>
|
|
|
|
<p>The University of Leipzig has published
|
|
<a href="http://wortschatz.uni-leipzig.de/html/wliste.html">
|
|
word lists</a> containing the 100, 1000 and 10000 most often used
|
|
words in English, German, French and Dutch. No copyrights or
|
|
restrictions seem to be applied to the downloadable files. These
|
|
can be very handy when putting together a bad_words file. Thanks
|
|
to Peter Asemann for this tip.</p>
|
|
|
|
<strong>4.7. <a name="q4.7">Can I use SSI or other CGIs in my
|
|
htsearch results?</a></strong><br>
|
|
<p>Not really. Apache will not parse CGI output for SSI
|
|
statements (See the <a
|
|
href="http://www.apache.org/docs/misc/FAQ.html#ssi-part-iii">Apache
|
|
FAQ</a>). Thus,the htsearch CGI does not understand SSI
|
|
markup and thus cannot include other
|
|
CGIs. However, it is possible doing it the other way round:
|
|
you can have the htsearch results included in your dynamic
|
|
page.
|
|
</p>
|
|
<p>
|
|
The Apache project has mentioned that this will be a
|
|
feature added to the Apache 2.0 version, currently in development.
|
|
</p>
|
|
|
|
<p>The easiest approach in the meantime is using SSI with
|
|
the help of the <a
|
|
href="attrs.html#script_name">script_name</a> configuration
|
|
file attribute. See the <code>contrib/scriptname</code>
|
|
directory for a small example using SSI.</p>
|
|
|
|
<p>For CGI and PHP, you need a "wrapper" script to
|
|
do that. For perl script examples, see the files in
|
|
<code>contrib/ewswrap</code>. The PHP guide (see <a
|
|
href="http://www.htdig.org/contrib/guides.html">contributed
|
|
guides</a>) not only describes a wrapper script for PHP, but
|
|
also offers a step by step tutorial to the basics of
|
|
ht://dig and is well worth reading.
|
|
For other alternatives, see question <a href="#q4.11">4.11</a>.
|
|
</p>
|
|
|
|
<strong>4.8. <a name="q4.8">How do I index Word, Excel, PowerPoint
|
|
or PostScript documents?</a></strong><br>
|
|
<p>This must be done with an
|
|
<a href="attrs.html#external_parsers">external parser or converter</a>.
|
|
A sample of such an external converter is the
|
|
contrib/doc2html/doc2html.pl Perl script.
|
|
It will parse Word, PostScript, PDF and other documents, when used
|
|
with the appropriate document to text converters. It uses catdoc to
|
|
parse Word documents, and ps2ascii to parse PostScript files. The
|
|
comments in the Perl script and accompanying documentation
|
|
indicate where you can obtain these converters.</p>
|
|
|
|
<p>Versions of htdig before 3.1.4 don't support external converters,
|
|
so you have to use an external parser script such as
|
|
contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
|
|
External converter scripts are simpler to write and maintain than a
|
|
full external parser, as they just convert input documents to
|
|
text/plain or text/html, and pass that back to htdig to be parsed.
|
|
Parsing is more consistent across document types with external
|
|
converters, because the final work is done by htdig's internal
|
|
parsers. External parser scripts tend to be hacks that don't
|
|
recognize a lot of the parsing attributes in your htdig.conf, so
|
|
they have to be hacked some more when you change your attributes.</p>
|
|
|
|
<p>The most recent versions of parse_doc.pl, conv_doc.pl and
|
|
the doc2html package are available on our <a
|
|
href="http://www.htdig.org/files/contrib/parsers/">web site</a>.<br>
|
|
See below for an example of doc2html.pl, or see the comments in
|
|
conv_doc.pl and parse_doc.pl, or the documentation for doc2html
|
|
for examples of their usage.
|
|
For help with troubleshooting, see questions
|
|
<a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>.</p>
|
|
|
|
<strong>4.9. <a name="q4.9">How do I index PDF files?</a></strong><br>
|
|
<p>This too can be done with an
|
|
<a href="attrs.html#external_parsers">external parser or converter</a>,
|
|
in combination with the pdftotext program that is part of the
|
|
<a href="http://www.foolabs.com/xpdf/">xpdf</a> 0.90 package. A
|
|
sample of such a converter is the doc2html.pl Perl
|
|
script. It uses pdftotext to parse PDF documents, then processes
|
|
the text into external parser records.
|
|
The most recent version of doc2html.pl is available on our <a
|
|
href="http://www.htdig.org/files/contrib/parsers/">web
|
|
site</a>.</p>
|
|
|
|
<p>For example, you could put this in your configuration file:</p>
|
|
<pre>
|
|
<a href="attrs.html#external_parsers">external_parsers</a>: application/msword->text/html /usr/local/bin/doc2html.pl \
|
|
application/postscript->text/html /usr/local/bin/doc2html.pl \
|
|
application/pdf->text/html /usr/local/bin/doc2html.pl
|
|
</pre>
|
|
<p>You would also need to configure the script to indicate where all
|
|
of the document to text converters are installed. See the DETAILS
|
|
file that comes with doc2html for more information.</p>
|
|
|
|
<p>Versions of htdig before 3.1.4 don't support external converters,
|
|
so you have to use an external parser script such as
|
|
contrib/parse_doc.pl (or better yet, upgrade htdig if you can).
|
|
See question <a href="#q4.8">4.8</a> above.</p>
|
|
|
|
<p>Whether you use this external parser or converter, or acroread
|
|
with the <a href="attrs.html#pdf_parser">pdf_parser</a> attribute,
|
|
to successfully index PDF files be sure to set the <a
|
|
href="attrs.html#max_doc_size">max_doc_size</a> attribute to
|
|
a value larger than the size of your largest PDF file. PDF
|
|
documents can not be parsed if they are truncated.</p>
|
|
|
|
<p>This also raises the questions of why two different
|
|
methods of indexing PDFs are supported, and which method
|
|
is preferred. The built-in PDF support, which uses acroread
|
|
to convert the PDF to PostScript, was the first method which
|
|
was provided. It had a few problems with it: acroread is not
|
|
open source, it is not supported on all systems on which
|
|
ht://Dig can run, and for some PDFs, the PostScript that
|
|
acroread generated was very difficult to parse into indexable
|
|
text. Also, the built-in PDF support expected PDF documents to
|
|
use the same character encoding as is defined in your current
|
|
<a href="attrs.html#locale">locale</a>, which isn't always the
|
|
case. The external converters, which use pdftotext, were developed
|
|
to overcome these problems. xpdf 0.90 is free software, and its
|
|
pdftotext utility works very well as an indexing tool.
|
|
It also converts various PDF encodings to the Latin 1 set.
|
|
It is the opinion of the developers that this is the
|
|
preferred method. However, some users still prefer to stick
|
|
with acroread, as it works well for them, and is a little
|
|
easier to set up if you've already installed Acrobat.</p>
|
|
|
|
<p>Also, pdftotext still has some difficulty handling text in
|
|
landscape orientation, even with its new -raw option in 0.90,
|
|
so if you need to index such text in PDFs, you may still get
|
|
better results with acroread. The pdf_parser attribute has been
|
|
removed from the 3.2 beta releases of htdig, so to use acroread
|
|
with htdig 3.2.0b5 or other 3.2 betas, use the acroconv.pl
|
|
external converter script from our <a
|
|
href="http://www.htdig.org/files/contrib/parsers/">web site</a>.</p>
|
|
|
|
<p>See also question <a href="#q5.2">5.2</a> below and
|
|
question <a href="#q1.13">1.13</a> above.
|
|
See questions <a href="#q5.37">5.37</a> and <a href="#q5.39">5.39</a>
|
|
for troubleshooting tips.</p>
|
|
|
|
<strong>4.10. <a name="q4.10">How do I index documents in other
|
|
languages?</a></strong><br>
|
|
<p>The first and most important thing you must do,
|
|
to allow ht://Dig to properly support international
|
|
characters, is to define the correct locale for the
|
|
language and country you wish to support. This is done
|
|
by setting the <a href="attrs.html#locale">locale</a>
|
|
attribute (see question <a href="#q5.8">5.8</a>). The
|
|
next step is to configure ht://Dig to use dictionary and
|
|
affix files for the language of your choice. These can
|
|
be the same dictionary and affix files as are used by the
|
|
ispell software. A collection of these is available from
|
|
Geoff Kuenning's
|
|
<a href="http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html">
|
|
International Ispell Dictionaries page</a>, and we're slowly
|
|
building a collection of word lists on our <a
|
|
href="http://www.htdig.org/files/contrib/wordlists/">web site</a>.</p>
|
|
<p>For example, if you install German dictionaries in common/german,
|
|
you could use these lines in your configuration file:</p>
|
|
<pre>
|
|
<a href="attrs.html#locale">locale</a>: de_DE
|
|
lang_dir: ${<a href="attrs.html#common_dir">common_dir</a>}/german
|
|
<a href="attrs.html#bad_word_list">bad_word_list</a>: ${lang_dir}/bad_words
|
|
<a href="attrs.html#endings_affix_file">endings_affix_file</a>: ${lang_dir}/german.aff
|
|
<a href="attrs.html#endings_dictionary">endings_dictionary</a>: ${lang_dir}/german.0
|
|
<a href="attrs.html#endings_root2word_db">endings_root2word_db</a>: ${lang_dir}/root2word.db
|
|
<a href="attrs.html#endings_word2root_db">endings_word2root_db</a>: ${lang_dir}/word2root.db
|
|
</pre>
|
|
<p>
|
|
You can build the endings database with <code>htfuzzy endings</code>.
|
|
(This command may actually take days to complete, for
|
|
releases older than 3.1.2. Current releases use faster regular
|
|
expression matching, which will speed this up by a few orders
|
|
of magnitude.) Note that the "*.0" files are not part of
|
|
the ispell dictionary distributions, but are easily made by
|
|
concatenating the partial dictionaries and sorting to remove
|
|
duplicates (e.g.: "<code>cat * | sort | uniq > lang.0</code>"
|
|
in most cases). You will also need to redefine the synonyms
|
|
file if you wish to use the synonyms search algorithm. This
|
|
file is not included with most of the dictionaries, nor is the
|
|
<a href="attrs.html#bad_words">bad_words</a> file.</p>
|
|
|
|
<p>If you put all the language-specific
|
|
dictionaries and configuration files in separate directories,
|
|
and set all the attribute definitions accordingly in each
|
|
search config file to access the appropriate files, you can
|
|
have a multilingual setup where the user selects the language
|
|
by selecting the "config" input parameter value. In addition
|
|
to the attributes given in the example above, you may also
|
|
want custom settings for these language-specific attributes:
|
|
<a href="attrs.html#date_format">date_format</a>,
|
|
<a href="attrs.html#iso_8601">iso_8601</a>,
|
|
<a href="attrs.html#method_names">method_names</a>,
|
|
<a href="attrs.html#no_excerpt_text">no_excerpt_text</a>,
|
|
<a href="attrs.html#no_next_page_text">no_next_page_text</a>,
|
|
<a href="attrs.html#no_prev_page_text">no_prev_page_text</a>,
|
|
<a href="attrs.html#nothing_found_file">nothing_found_file</a>,
|
|
<a href="attrs.html#page_list_header">page_list_header</a>,
|
|
<a href="attrs.html#prev_page_text">prev_page_text</a>,
|
|
<a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
|
|
(or <a href="attrs.html#search_results_header">search_results_header</a>
|
|
and <a href="attrs.html#search_results_footer">search_results_footer</a>),
|
|
<a href="attrs.html#sort_names">sort_names</a>,
|
|
<a href="attrs.html#synonym_db">synonym_db</a>,
|
|
<a href="attrs.html#synonym_dictionary">synonym_dictionary</a>,
|
|
<a href="attrs.html#syntax_error_file">syntax_error_file</a>,
|
|
<a href="attrs.html#template_map">template_map</a>, and of course
|
|
<a href="attrs.html#database_dir">database_dir</a> or
|
|
<a href="attrs.html#database_base">database_base</a> if you
|
|
maintain multiple databases for sites of different languages.
|
|
You could also change the definition of
|
|
<a href="attrs.html#common_dir">common_dir</a>, rather than
|
|
making up a lang_dir attribute as above, as many language-specific
|
|
files are defined relative to the common_dir setting.</p>
|
|
|
|
<p>If you're running version 3.1.6 of ht://Dig, you may also
|
|
be interested in the <strong>accents</strong> fuzzy match
|
|
algorithm in the
|
|
<a href="attrs.html#search_algorithm">search_algorithm</a>
|
|
attribute, which lets you treat accented and unaccented letters
|
|
as equivalent in words. Note that if you use the accents algorithm,
|
|
you need to rebuild the accents database each time you update your
|
|
word database, using <code>"htfuzzy accents"</code>. This command
|
|
isn't in the default rundig script, so you may want to add it there.
|
|
The accents fuzzy match algorithm is also in the 3.2 beta releases.
|
|
There are also the
|
|
<a href="attrs.html#boolean_keywords">boolean_keywords</a> and
|
|
<a href="attrs.html#boolean_syntax_errors">boolean_syntax_errors</a>
|
|
attributes in 3.1.6 for changing other language-specific messages
|
|
in htsearch.</p>
|
|
|
|
<p>Current versions of ht://Dig only support 8-bit
|
|
characters, so languages such as Chinese and Japanese, which
|
|
require 16-bit characters, are not currently supported.</p>
|
|
|
|
<p>Didier Lebrun has written a guide for configuring htdig to
|
|
support French, entitled
|
|
<a href="http://www.quartier-rural.org/dl/elucu/htdig-vf/lisezmoi.html">
|
|
Comment installer et configurer HtDig pour la langue française</a>.
|
|
His "kit de francisation" is also available on
|
|
<a
|
|
href="http://www.htdig.org/files/contrib/wordlists/">our
|
|
web site</a>.</p>
|
|
|
|
<p>See also question <a href="#q4.2">4.2</a> for tips on customizing
|
|
htsearch, and question <a href="#q4.6">4.6</a> for tips where to find
|
|
bad_words files.</a></p>
|
|
|
|
<strong>4.11. <a name="q4.11">How do I get rotating banner ads in
|
|
search results?</a></strong><br>
|
|
<p>While htsearch doesn't currently provide a means of doing
|
|
SSI on its output, or calling other CGI scripts, it does have
|
|
the capability of using environment variables in templates.</p>
|
|
|
|
<p>The easiest way to get rotating banners in htsearch is
|
|
to replace htsearch with a wrapper script that sets an
|
|
environment variable to the banner content, or whatever
|
|
dynamically generated content you want. Your script can then
|
|
call the real htsearch to do the work. The wrapper script can be
|
|
written as a shell script, or in Perl, C, C++, or whatever you
|
|
like. You'd then need to reference that environment variable
|
|
in header.html (or wrapper.html if that's what you're using),
|
|
to indicate where the dynamic content should be placed.</p>
|
|
|
|
<p>If the dynamic content is generated by a CGI script, your new
|
|
wrapper script which calls this CGI would then have to strip out
|
|
the parts that you don't want embedded in the output (headers,
|
|
some tags) so that only the relevant content gets put into the
|
|
environment variable you want. You'd also have to make sure
|
|
this CGI script doesn't grab the POST data or get confused by
|
|
the QUERY_STRING contents intended for htsearch. Your script
|
|
should not take anything out of, or add anything to, the
|
|
QUERY_STRING environment variable.</p>
|
|
|
|
<p>An alternative approach is to have a cron job that periodically
|
|
regenerates a different header.html or wrapper.html with the
|
|
new banner ad, or changes a link to a different pre-generated
|
|
header.html or wrapper.html file. For other alternatives, see
|
|
question <a href="#q4.7">4.7</a>.</p>
|
|
|
|
<strong>4.12. <a name="q4.12">How do I index numbers in documents?</a></strong><br>
|
|
<p>By default, htdig doesn't treat numbers without letters
|
|
as words, so it doesn't index them.
|
|
To change this behavior, you must set the
|
|
<a href="attrs.html#allow_numbers">allow_numbers</a>
|
|
attribute to true, and rebuild your index from scratch using
|
|
rundig or htdig with the -i option, so that bare numbers get
|
|
added to the index.</p>
|
|
|
|
<strong>4.13. <a name="q4.13">How can I call htsearch from a hypertext
|
|
link, rather than from a search form?</a></strong><br>
|
|
<p>If you change the search.html form to use the GET method
|
|
rather than POST, you can see the URLs complete with all the
|
|
arguments that htsearch needs for a query. Here is an example:<br>
|
|
<code>
|
|
http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&restrict=&exclude=&method=and&format=builtin-long&words=grapple+grommets
|
|
</code>
|
|
which can actually be simplified to:<br>
|
|
<code>
|
|
http://www.grommetsRus.com/cgi-bin/htsearch?method=and&words=grapple+grommets
|
|
</code>
|
|
with the current defaults. The "&" character acts as a
|
|
separator for the input parameters, while the "+" character
|
|
acts as a space character within an input parameter.
|
|
In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolon
|
|
character ";" as a parameter separator, rather than "&", for
|
|
HTML 4.0 compliance.
|
|
Most non-alphanumeric characters should be hex-encoded following
|
|
the convention for URL encoding (e.g. "%" becomes "%25", "+"
|
|
becomes "%2B", etc). Any htsearch input parameter that you'd
|
|
use in a search form can be added to the URL in this way.
|
|
This can be embedded into an <a href="..."> tag.
|
|
<br>See also question <a href="#q5.21">5.21</a>.</p>
|
|
|
|
<strong>4.14. <a name="q4.14">How do I restrict a search to only meta
|
|
keywords entries in documents?</a></strong><br>
|
|
<p>First of all, you do <strong>not</strong> do this by using the
|
|
"keywords" field in the search form. This seems to be a
|
|
frequent cause of confusion. The "keywords" input parameter
|
|
to htsearch has absolutely nothing to do with searching meta
|
|
keywords fields. It actually predates the addition of meta
|
|
keyword support in 3.1.x. A better choice of name for the
|
|
parameter would have been "requiredwords", because that's what
|
|
it really means - a list of words that are all required to be
|
|
found somewhere in the document, in addition to the words the
|
|
user specifies in the search form.</p>
|
|
|
|
<p>As of 3.2.0b5, the most direct way to search for a particular
|
|
meta keyword is to specify the word as "keyword:<word>".
|
|
Similarly, "title:", "heading:", and "author:" restrict searches
|
|
to the respective fields. To search for words in the body of the
|
|
text, use "text:".</p>
|
|
|
|
<p>To restrict all search terms to meta keywords only, you can set all
|
|
<a href="attrs.html#heading_factor">factors</a> other than
|
|
keywords_factor to 0, and for 3.1.x, you
|
|
must then reindex your documents. In the 3.2 betas, you can
|
|
change factors at search time without needing to reindex.
|
|
As of 3.2.0b5, it is possible to restrict
|
|
the search in the query itself. Note that changing the scoring
|
|
factors in this way will only alter the scoring of search results,
|
|
and shift the low or zero scores to the end of the results when
|
|
sorting by score (as is done by default). For versions before
|
|
3.2.0b5, the results with scores
|
|
of zero aren't actually removed from the search results.</p>
|
|
|
|
<strong>4.15. <a name="q4.15">Can I use meta tags to prevent htdig from
|
|
indexing certain files?</a></strong><br>
|
|
<p>Yes, in each HTML file you want to exclude, add the following
|
|
between the <HEAD> and </HEAD> tags:</p>
|
|
<blockquote>
|
|
<META NAME="robots" CONTENT="noindex, follow">
|
|
</blockquote>
|
|
<p>Doing so will allow htdig to still follow links to other documents,
|
|
but will prevent this document from being put into the index itself.
|
|
You can also use "nofollow" to prevent following of links. See
|
|
the section on <a href="meta.html">Recognized META information</a>
|
|
for more details. For documents produced automatically by MhonArc,
|
|
you can have that line inserted automatically by putting it in the
|
|
MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.</p>
|
|
|
|
<p>You can also use the
|
|
<a href="attrs.html#noindex_start">noindex_start</a> and
|
|
<a href="attrs.html#noindex_end">noindex_end</a> attributes to
|
|
define one set of tags which will mark sections to be stripped out
|
|
of documents, so they don't get indexed, or you can mark sections
|
|
with the non-DTD <noindex> and </noindex> tags.
|
|
The noindex_start and noindex_end attributes can also be used to
|
|
suppress in-line JavaScript code that wasn't properly enclosed in
|
|
HTML comment tags (see question <a href="#q4.26">4.26</a>).
|
|
In 3.1.6, you can also put a section between <noindex follow>
|
|
and </noindex> tags to turn off indexing of text but still
|
|
allow htdig to follow links.</p>
|
|
|
|
<p>If you require much more elaborate schemes for avoiding indexing
|
|
certain parts of your HTML files, especially if you don't have
|
|
control over these files and can't add tags to them, you can
|
|
set up htdig's
|
|
<a href="attrs.html#external_parsers">external_parsers</a> attribute
|
|
with an external converter that will preprocess the HTML before
|
|
it's parsed and indexed by htdig. Examples of this are the
|
|
unhypermail.sh script in our
|
|
<a href="http://www.htdig.org/files/contrib/parsers/">contributed parsers</a>
|
|
and the ungeoify.sh script in our
|
|
<a href="http://www.htdig.org/files/contrib/scripts/">contributed scripts</a>.
|
|
By preprocessing the HTML, you can strip out parts you don't want, or
|
|
you can add or change tags wherever they're needed, if you're willing
|
|
to put in the effort to learn awk/sed/perl enough to do the job.</p>
|
|
|
|
<strong>4.16. <a name="q4.16">How do I get htsearch to use the star image
|
|
in a different directory than the default /htdig?</a></strong><br>
|
|
<p>You must set either the
|
|
<a href="attrs.html#image_url_prefix">image_url_prefix</a> attribute,
|
|
or both <a href="attrs.html#star_blank">star_blank</a> and
|
|
<a href="attrs.html#star_image">star_image</a> in your
|
|
htdig.conf, to refer to the URL path for these files. You should
|
|
also set this URL path similarly in in common/header.html and
|
|
common/wrapper.html, as they also refer to the star.gif file.
|
|
If you want to relocate other graphics, such as the buttons or
|
|
the ht://Dig logo, you should change all references to these
|
|
in htdig.conf and common/*.html.</p>
|
|
|
|
<strong>4.17. <a name="q4.17">How do I get htdig or htsearch to rewrite
|
|
URLs in the search results?</a></strong><br>
|
|
<p>This can be done by using the <a
|
|
href="attrs.html#url_part_aliases">url_part_aliases</a>
|
|
configuration file attribute. You have to set up different
|
|
configuration files for htdig and htsearch, to define a
|
|
different setting of this attribute for each one.</p>
|
|
|
|
<p>A large number of users insist on ignoring that last point
|
|
and try to make do with just one definition, either for htdig
|
|
or htsearch, or sometimes for both. This seems to stem from
|
|
a fundamental misunderstanding of how this attribute works,
|
|
so perhaps a clarification is needed. The url_part_aliases
|
|
attribute uses a two stage process. In the first stage, htdig
|
|
encodes the URLs as they go into the database, by using the
|
|
pairs in url_part_aliases going from left to right. In the
|
|
second stage, htsearch decodes the encoded URLs taken from the
|
|
database, by using the pairs in url_part_aliases going from
|
|
right to left. If you have the same value for url_part_aliases
|
|
in htdig and htsearch, you end up with the same URLs in the
|
|
end. If you modify the first string (the from string) in
|
|
the pairs listed in url_part_aliases for htsearch, then when
|
|
htsearch decodes the URLs it ends up rewriting part of them.</p>
|
|
|
|
<p>While you might think that if you don't use url_part_aliases
|
|
in htdig, then you can use it in htsearch to alter unencoded
|
|
URLs, the reality is that if you don't encode parts of URLs
|
|
using url_part_aliases, they still get encoded automatically
|
|
by the <a href="attrs.html#common_url_parts">common_url_parts</a>
|
|
attribute. This helps to reduce the size of your databases. So,
|
|
trying to use url_part_aliases only in htsearch doesn't work
|
|
because there are no unencoded URLs in the database, so the
|
|
right hand strings in the pairs you define won't match anything.</p>
|
|
|
|
<p>You also can't put two different definitions of the
|
|
url_part_aliases attribute in a single configuration file, as
|
|
some users have attempted. When you define an attribute twice,
|
|
the second definition merely overrides the first. Pay close
|
|
attention to the description and examples for
|
|
<a href="attrs.html#url_part_aliases">url_part_aliases</a>.
|
|
You must put one definition of this attribute in your
|
|
configuration file for htdig, htmerge (or htpurge) and htnotify,
|
|
and a different definition of it in your configuration file
|
|
for htsearch.</p>
|
|
|
|
<strong>4.18. <a name="q4.18">What are all the options in
|
|
htdig.conf, and are there others?</a></strong><br>
|
|
<p>In ht://Dig's terminology, the settings in its configuration
|
|
files are called <a href="attrs.html">configuration attributes</a>,
|
|
to distinguish them from <a href="htdig.html">command line
|
|
options</a>, <a href="hts_form.html">CGI input parameters</a>
|
|
and <a href="hts_templates.html">template variables</a>. There are
|
|
many, many attributes that can be set to control almost all
|
|
aspects of indexing, searching, customization of output and
|
|
internationalization. All attributes have a built-in default
|
|
setting, and only a subset of these appear in the sample htdig.conf
|
|
file. See the documentation for all default values for attributes
|
|
not overridden in the configuration file, and for help on using
|
|
any of them.
|
|
See also question <a href="#q1.15">1.15</a>.</p>
|
|
|
|
<strong>4.19. <a name="q4.19">How do I get more than 10 pages of
|
|
10 search results from htsearch?</a></strong><br>
|
|
<p>There are two attributes that control the number of matches per
|
|
page and the total number of pages. The number of matches per page
|
|
can be set in your configuration file, using the
|
|
<a href="attrs.html#matches_per_page">matches_per_page</a> attribute,
|
|
or in your <a href="hts_form.html">search form</a>, using the
|
|
<strong>matchesperpage</strong> input parameter.</p>
|
|
|
|
<p>The number of pages is controlled by the
|
|
<a href="attrs.html#maximum_pages">maximum_pages</a> attribute in
|
|
your search configuration file.
|
|
The current default for maximum_pages is 10 because the ht://Dig
|
|
package comes with 10 images, with numbers 1 through 10, for
|
|
use as page list buttons. If we increased the limit, we'd have
|
|
to field a whole lot more questions from users irate because
|
|
only the first 10 buttons are graphics, and the rest are text.
|
|
If you want more than 10 pages of results, change maximum_pages,
|
|
but you may also want to set the
|
|
<a href="attrs.html#page_number_text">page_number_text</a> and
|
|
<a href="attrs.html#no_page_number_text">no_page_number_text</a>
|
|
attributes in your search configuration file to nothing, or
|
|
remove them, to use text rather than images for the links to
|
|
other pages.</p>
|
|
|
|
<p>In version of htsearch before 3.1.4, maximum_pages
|
|
limited only the number of page list buttons, and not the
|
|
actual number of pages. This was changed because there was no
|
|
means of limiting the total number of pages, but this ended up
|
|
frustrating users who wanted the ability to have more pages than
|
|
buttons. In 3.2.0b3 and 3.1.6 we introduced a
|
|
<a href="attrs.html#maximum_page_buttons">maximum_page_buttons</a>
|
|
attribute for this purpose.</p>
|
|
|
|
<strong>4.20. <a name="q4.20">How do I restrict a search to only
|
|
certain subdirectories or documents?</a></strong><br>
|
|
<p>That depends on whether you want to protect certain parts of
|
|
your site from prying eyes, or just limit the scope of search
|
|
results to certain relevant areas. For the latter, you just need
|
|
to set the <strong>restrict</strong> or <strong>exclude</strong>
|
|
input parameter in the <a href="hts_form.html">search form</a>.
|
|
This can be done using hidden input fields containing preset
|
|
values, text input fields, select lists, radio buttons or
|
|
checkboxes, as you see fit. If you use select lists, you can
|
|
propagate the choices to select lists in the follow-up search
|
|
forms using the
|
|
<a href="attrs.html#build_select_lists">build_select_lists</a>
|
|
configuration attribute.
|
|
The University at Albany has a good description of how to use
|
|
the <strong>restrict</strong> or <strong>exclude</strong> input
|
|
parameters: <a href="http://www.albany.edu/its/web/search/">
|
|
Constructing a local search using ht://Dig Search forms</a>.
|
|
<br>To include a hex encoded character (such as a %20 for a space)
|
|
in a restrict or exclude string, the '%' must again be encoded.
|
|
For example, to match a filename containing a space, the URL must
|
|
contain %20, and so the CGI parameter passed to htsearch must
|
|
contain %2520. The %25 encodes the '%'. (Note that this is only
|
|
necessary for CGI input parameters, not for the corresponding
|
|
configuration attributes in your htdig.conf file, as attributes
|
|
aren't subjected to the same hex decoding step as parameters are.)
|
|
<br>See also question <a href="#q4.4">4.4</a>.</p>
|
|
|
|
<p>If you wish to keep secure and non-secure areas on
|
|
your site separate, and avoid having unauthorized users
|
|
seeing documents from secure areas in their search results,
|
|
that takes a bit more effort. You certainly can't rely on
|
|
the <strong>restrict</strong> and <strong>exclude</strong>
|
|
parameters, or even the <strong>config</strong> parameter,
|
|
as any parameter in a search form can also be overridden
|
|
by the user in a URL with CGI parameters. The safest
|
|
option would be to host the secure and non-secure areas on
|
|
separate servers with independent installations of htsearch,
|
|
each with its own ht://Dig database, but that is often too
|
|
costly or impractical an option. The next best thing is to
|
|
host them on the same site, but make sure that everything
|
|
is very clearly separated to prevent any leakage of secure
|
|
data. You should maintain separate databases for the secure
|
|
and public areas of your site, by setting up different htdig
|
|
configuration files for each area. Use different settings
|
|
of the <a href="attrs.html#start_url">start_url</a>,
|
|
<a href="attrs.html#limit_urls_to">limit_urls_to</a>
|
|
and <a href="attrs.html#database_dir">database_dir</a>
|
|
configuration attributes, and possibly even different
|
|
<a href="attrs.html#common_dir">common_dir</a> settings as well.
|
|
Make sure your database_dir, and even your common_dir, are not
|
|
in any directories accessible from the web server. Run htdig
|
|
and htmerge (or rundig) with each separate configuration file,
|
|
to build your two databases.</p>
|
|
|
|
<p>The tricky part is to make sure your htsearch program is
|
|
secure. You don't want to use the same htsearch for the secure
|
|
and public sites, because otherwise the public site could
|
|
access the configuration for the secure database, making its
|
|
data publicly accessible. You must either compile two separate
|
|
versions of htsearch, with different settings of the CONFIG_DIR
|
|
<em>make</em> variable, or you must make a simple wrapper
|
|
script for htsearch that overrides the compiled-in CONFIG_DIR
|
|
setting by a different setting of the CONFIG_DIR environment
|
|
variable. Make sure the CONFIG_DIR for the secure area is
|
|
not a subdirectory of the CONFIG_DIR for the public area.
|
|
In this way, you can maintain separate directories of config
|
|
files for the public and secure sites, so that the secure
|
|
config files are not accessible from the public htsearch.</p>
|
|
|
|
<p>Put the htsearch binary or wrapper script for the secure site
|
|
in a different ScriptAlias'ed cgi-bin directory than the public
|
|
one, and protect the secure cgi-bin with a .htaccess file or
|
|
in your server configuration. Alternatively, you can put the
|
|
secure program, let's call it htssearch, in the same cgi-bin,
|
|
but protect that one CGI program in your server configuration,
|
|
e.g.:</p>
|
|
<pre>
|
|
<Location /cgi-bin/htssearch>
|
|
AuthType Basic
|
|
AuthName ....
|
|
AuthUserFile ...
|
|
AuthGroupFile ...
|
|
<Limit GET POST>
|
|
require group foo
|
|
</Limit>
|
|
</Location>
|
|
</pre>
|
|
<p>This describes the setup for an Apache server. You'd need to
|
|
work out an equivalent configuration for your server if you're
|
|
not running Apache.</p>
|
|
|
|
<strong>4.21. <a name="q4.21">How can I allow people to search
|
|
while the index is updating?</a></strong><br>
|
|
<p>Answer contributed by Avi Rappoport <avirr@searchtools.com></p>
|
|
<p>If you have enough disk space for two copies of the index
|
|
database, use -a with the htdig and htmerge processes. This will
|
|
make use of a copy of the index database with the extension
|
|
".work", and update the copy instead of the originals.
|
|
This way, htsearch can use those originals while the update is
|
|
going on. When it's done, you can move the .work versions to
|
|
replace the originals, and htsearch will use them. The current
|
|
rundig script will do this for you if you supply the -a flag
|
|
to it. However, rundig builds the database from scratch each
|
|
time you run it. If you want to update an alternate copy of
|
|
the database, see the
|
|
<a href="http://www.htdig.org/files/contrib/scripts/rundig.sh">contributed
|
|
rundig.sh script</a>.</p>
|
|
|
|
<strong>4.22. <a name="q4.22">How can I get htdig to ignore the
|
|
robots.txt file or meta robots tags?</a></strong><br>
|
|
<p>You can't, and you shouldn't. The
|
|
<a href="http://www.robotstxt.org/wc/norobots.html">
|
|
Standard for Robot Exclusion</a> exists for a very good reason,
|
|
and any well behaved indexing engine or spider should conform to it.
|
|
If you have a problem with a robots.txt file, you really should
|
|
take it up with the site's webmaster. If they don't have a problem
|
|
with you indexing their site, they shouldn't mind setting up a
|
|
User-agent entry in their robots.txt file with a name you both
|
|
agree on. The user agent setting that htdig uses for matching
|
|
entries in robots.txt can be changed via the
|
|
<a href="attrs.html#robotstxt_name">robotstxt_name</a> attribute in
|
|
your config file.</p>
|
|
|
|
<p>If you have a problem with a robots meta tag in a document
|
|
(see question <a href="#q4.15">4.15</a>) you should take it up
|
|
with the author or maintainer of that page. These tags are an
|
|
all or nothing deal, as they can't be set up to allow some engines
|
|
and disallow others. If htdig encounters them, it has to give the
|
|
page's creator the benefit of the doubt and honour them. If
|
|
exceptions to the rule are wanted, this should be done with a
|
|
robots.txt file rather than a meta tag.</p>
|
|
|
|
<strong>4.23. <a name="q4.23">How can I get htdig not to index
|
|
some directories, but still follow links?</a></strong><br>
|
|
<p>You can simply add the directory name to your robots.txt file
|
|
or to the <a href="attrs.html#exclude_urls">exclude_urls</a>
|
|
attribute in your configuration, but that will exclude all files
|
|
under that directory. If you want the files in that directory to
|
|
be indexed, you have a couple options. You can add an index.html
|
|
file to the directory, that will include a robots meta tag
|
|
(see question <a href="#q4.15">4.15</a>) to prevent indexing,
|
|
and will contain links to all your files in this directory.
|
|
The drawback of this is that you must maintain the index.html
|
|
file yourself, as it won't be automatically updated as new
|
|
files are added to the directory.</p>
|
|
|
|
<p>The other technique you can use, if you want the directory
|
|
index to be made by the web server, is to get the server to
|
|
insert the robots meta tag into the index page it generates.
|
|
In Apache, this is done using the
|
|
<a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#headername">HeaderName</a>
|
|
and <a href="http://httpd.apache.org/docs/mod/mod_autoindex.html#indexoptions">IndexOptions</a>
|
|
directives in the directory's <strong>.htaccess</strong> file.
|
|
For example:</p>
|
|
<pre> HeaderName .htrobots
|
|
IndexOptions FancyIndexing SuppressHTMLPreamble
|
|
</pre>
|
|
<p>and in the .htrobots file:</p>
|
|
<pre><HTML><head>
|
|
<META NAME="robots" CONTENT="noindex, follow">
|
|
<title>Index of /this/dir</title>
|
|
</head>
|
|
</pre>
|
|
|
|
<p>If you don't mind getting just one copy of each directory,
|
|
but want to suppress the multiple copies generated by Apache's
|
|
FancyIndexing option, you can either turn off FancyIndexing or
|
|
you can add "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to
|
|
the <a href="attrs.html#bad_querystr">bad_querystr</a> attribute
|
|
(without the quotes) to suppress the alternately sorted views of
|
|
the directory. For Apache 2.x, you'd use "C=D C=M C=N C=S O=A O=D"
|
|
instead in your bad_querystr setting.</p>
|
|
|
|
<strong>4.24. <a name="q4.24">How can I get rid of duplicates in
|
|
search results?</a></strong><br>
|
|
<p>This depends on the cause of the duplicate documents. htdig
|
|
does keep track of the URLs it visits, so it never puts the
|
|
same URL more than once in the database. So, if you have
|
|
duplicate documents in your search results, it's because the
|
|
same document appears under different URLs. Sometimes the
|
|
URLs vary only slightly, and in subtle ways, so you may have
|
|
to look hard to find out what the variation is. Here are some
|
|
common reasons, each requiring a different solution.</p>
|
|
|
|
<ul>
|
|
<li>You're indexing a case insensitive web
|
|
server (e.g. an NT based server), but the
|
|
<a href="attrs.html#case_sensitive">case_sensitive</a> attribute is
|
|
still set to true. In this case, if htdig encounters two URLs
|
|
pointing to the same document, but the case of the letters in
|
|
one is different than the other (even if it's only 1 letter),
|
|
it will not treat them as the same URL.<br><br>
|
|
<li>You have symbolic links (or hard links) to some of
|
|
these documents, so they can be reached by several URLs.
|
|
The solution here is to build an exclude list of URLs that
|
|
are actually symbolic links, and putting these in
|
|
<a href="attrs.html#exclude_urls">exclude_urls</a>
|
|
(or in your robots.txt file). You can automate this using a
|
|
technique similar to the find command in question
|
|
<a href="#q5.25">5.25</a> which builds the start_url list, but
|
|
adding a -type l to find symbolic links.<br><br>
|
|
<li>You have copies of the same documents in different
|
|
locations. This is similar to the symbolic link problem above,
|
|
but harder to fix automatically.<br><br>
|
|
<li>The duplicate URLs result from CGI, SSI or other dynamic pages
|
|
that give the same content even though there may be variations in
|
|
the query string or other parts of the URL. The approach to
|
|
fix this is similar to the fix above, but may be less easy
|
|
to automate, depending on what the variations are. You can
|
|
add patterns to exclude_urls or bad_querystr to get rid of
|
|
unwanted variations. These are especially important to bring
|
|
under control, because in some cases, if left unchecked, they
|
|
can result in an <em>infinite virtual hierarchy</em> which htdig
|
|
will never be able to finish indexing. For example, in a CGI-based
|
|
calendar, htdig could go on following next month or next
|
|
year links to infinity, but this can be stopped by adding a
|
|
stop year to <a href="attrs.html#bad_querystr">bad_querystr</a>.
|
|
<br><br>Another common example happens when htdig hits a link
|
|
to an SSI page and the URL has an extra trailing slash. This
|
|
can happen with either .shtml pages or .html pages that use
|
|
the XBitHack. The trailing slash causes the URL to be misinterpreted
|
|
as a directory URL, and any relative URLs in the document are added
|
|
to the URL, creating longer and longer URLs that still lead to the
|
|
same SSI document. There are two things you can do:<ol>
|
|
<li>hunt down the pages with the incorrect links, i.e.
|
|
search for ".shtml/" or ".html/" in URLs in your documents,
|
|
and fix these links; or
|
|
<li>add .shtml/ and .html/ to your
|
|
<a href="attrs.html#exclude_urls">exclude_urls</a>
|
|
setting to get htdig to ignore these defective links.
|
|
</ol>The second option is easier, but you run the risk that htdig
|
|
will miss some SSI pages if the only links to them have the trailing
|
|
slash, so you may want to try hunting down the links anyway.
|
|
<br><br>See also question <a href="#q5.29">5.29</a>.<br><br>
|
|
<li>The duplicates result from session IDs in PHP or other dynamic
|
|
pages that give the same content even though the ID changes during
|
|
the indexing process. This can lead not only to duplicates, but
|
|
also to URLs that become unusable because of expired session IDs.
|
|
Session IDs are the bane of search engines, and you should avoid
|
|
using them if at all possible. If getting rid of them altogether
|
|
isn't an option, then you can at least remove them while indexing,
|
|
using the <a href="attrs.html#url_rewrite_rules">url_rewrite_rules</a>
|
|
attribute. This will only work if htdig can access the documents
|
|
without a session ID, as htdig rewrites the URL before fetching the
|
|
document, and htsearch presents the rewritten URL (without session
|
|
ID) in search results.
|
|
</ul>
|
|
|
|
<strong>4.25. <a name="q4.25">How can I change the scores in
|
|
search results, and what are the defaults?</a></strong><br>
|
|
<p>The scores are calculated mostly by htdig at indexing time,
|
|
with some tweaking done by htsearch at search time. There are
|
|
a number of <a href="attrs.html">configuration attributes</a>,
|
|
all called <em><something></em><strong>_factor</strong>,
|
|
which can control the scoring calculations. In addition, the
|
|
location of words within the document has an effect on score,
|
|
as word scores are also multiplied by a varying location
|
|
factor somewhere in between 1000 for words near the start
|
|
and 1 for words near the end of the document. As of yet,
|
|
there is no way to change this factor. For any of the scoring
|
|
factors you can configure, and which are used by htdig, you
|
|
will have to reindex your documents so the new factors take
|
|
effect. The default values for these scoring factors, as well as
|
|
information about whether they're used by htdig or htsearch,
|
|
are all listed in the <a href="attrs.html">configuration
|
|
attributes documentation</a>. Malcolm Austen has written some
|
|
<a href="http://wwwsearch.ox.ac.uk/scores.html">notes on page
|
|
scores</a> for 3.1.x which you may find helpful.</p>
|
|
|
|
<p>Note that the above applies to the 3.1.x releases, while
|
|
in the 3.2 beta releases, all scores are calculated at search
|
|
time with no weight being put on the location of words within
|
|
the document.</p>
|
|
|
|
<strong>4.26. <a name="q4.26">How can I get htdig not to index
|
|
JavaScript code or CSS?</a></strong><br>
|
|
<p>The HTML parser in htdig recognizes and parses only HTML,
|
|
which is all there should be within an HTML file. If your HTML
|
|
files contain in-line JavaScript code or Cascading Style Sheets
|
|
(CSS), these in-line codes, which are clearly not HTML, should
|
|
be enclosed within an HTML comment tag so they are hidden
|
|
from view from the HTML parser, or for that matter from any
|
|
web client that is not JavaScript-aware or CSS-aware. See
|
|
<a href="http://www.mcli.dist.maricopa.edu/show/interact/js_b.html">
|
|
Behind the Scenes with JavaScript</a> for a description of the
|
|
technique, which applies equally well to in-line style sheets.
|
|
If fixing up all non-HTML compliant JavaScript or CSS code in
|
|
your HTML files is not an option, then see question
|
|
<a href="#q4.15">4.15</a> for an alternate technique.</p>
|
|
|
|
<p>The HTML parser in htdig 3.1.6 tries skipping over bare
|
|
in-line JavaScript code in HTML, unlike previous versions,
|
|
but a small bug in the parser causes it to be thrown off by a
|
|
"<" sign in the JavaScript, and it may then miss the closing
|
|
</script> tag. This can be fixed by applying this
|
|
<a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/JavaScript.0">
|
|
patch</a>.</p>
|
|
|
|
<hr noshade size=2>
|
|
|
|
<h3>5. Troubleshooting</h3>
|
|
<strong>5.1. <a name="q5.1">I can't seem to index more than X documents
|
|
in a directory.</a></strong><br>
|
|
<p>This usually has to do with the default document size
|
|
limit. If you set <a href="attrs.html#max_doc_size">
|
|
max_doc_size</a> in your config file to
|
|
something enough to read in the directory index (try 100000 for
|
|
100K) this should fix this problem. Of course this will require
|
|
more memory to read the larger file. Don't set it to a value
|
|
larger than the amount of memory you have, and never more than
|
|
about 2 billion, the maximum value of a 32-bit integer.
|
|
If htdig is missing entire directories, see question
|
|
<a href="#q5.25">5.25</a>.</p>
|
|
|
|
<strong>5.2. <a name="q5.2">I can't index PDF files.</a></strong><br>
|
|
<p>As above, this usually has to do with the default document
|
|
size. What happens is ht://Dig will read in part of a PDF file
|
|
and try to index it. This usually fails. Try setting
|
|
<a href="attrs.html#max_doc_size">max_doc_size</a>
|
|
in your config file to a larger value than the
|
|
size of your largest PDF file. Don't go overboard, though, as
|
|
you don't want to overflow a 32-bit integer (about 2 billion),
|
|
and you don't want to allocate much more memory than you need
|
|
to store the largest document.</p>
|
|
|
|
<p>There is a bug in Adobe Acrobat Reader version 4, in its
|
|
handling of the -pairs option, which causes a segmentation
|
|
violation when using it with htdig 3.1.2 or earlier. There is
|
|
a workaround for this as of version 3.1.3 - you must remove
|
|
the -pairs option from your pdf_parser definition, if it's
|
|
there. However, acroread version 4 is still very unstable (on
|
|
Linux, anyway) so it is not recommended as a PDF parser. An
|
|
alternative is to use an external converter with the xpdf 0.90
|
|
package installed on your system, as described in question <a
|
|
href="#q4.9">4.9</a> above.</p>
|
|
|
|
<strong>5.3. <a name="q5.3">When I run "rundig," I get a message about
|
|
"DATABASE_DIR" not being found.</a></strong><br>
|
|
<p>This is due to a bug in the Makefile.in file in version
|
|
3.1.0b1. The easiest fix is to edit the rundig file and change
|
|
the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory
|
|
with a large amount of temporary disk space for htmerge. This
|
|
bug is fixed in version 3.1.0b2.</p>
|
|
|
|
<strong>5.4. <a name="q5.4">When I run htmerge, it stops with an "out
|
|
of diskspace" message.</a></strong><br>
|
|
<p>This means that htmerge has run out of temporary disk space
|
|
for sorting. Either in your "rundig" script (if you run htmerge
|
|
through that) or before you run htmerge, set the variable TMPDIR
|
|
to a temp directory with lots of space.</p>
|
|
|
|
<strong>5.5. <a name="q5.5">I have problems running rundig from cron
|
|
under Linux.</a></strong><br>
|
|
<p>This problem commonly occurs on Red Hat Linux 5.0 and 5.1,
|
|
because of a bug in vixie-cron. It causes htmerge to fail with a
|
|
"Word sort failed" error. It's fixed in Red Hat 5.2.
|
|
You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2
|
|
distribution to fix the problem on 5.0 or 5.1. A quick fix for
|
|
the problem is to change the first line of rundig to "#!/bin/ash"
|
|
which will run the script through the ash shell, but this doesn't
|
|
solve the underlying problem.</p>
|
|
|
|
<strong>5.6. <a name="q5.6">When I run htmerge, it stops with an
|
|
"Unexpected file type" message.</a></strong><br>
|
|
<p>Often this is because the databases are corrupt. Try removing
|
|
them and rebuilding. If this doesn't work, some have found that
|
|
the solution for question <a href="#q3.2">3.2</a> works for this
|
|
as well. This should be fixed in versions from 3.1.x</p>
|
|
|
|
<strong>5.7. <a name="q5.7">When I run htsearch, I get lots of Internal
|
|
Server Errors (#500).</a></strong><br>
|
|
<p>If you are running under Solaris, see <a href="#q3.6">3.6</a>.
|
|
The solution for Solaris may also work for other OSes that use shared
|
|
libraries in non-standard locations, so refer to question 3.6 if
|
|
you suspect a shared library problem. In any case, check your web
|
|
server error logs to see the cause of the internal server errors.
|
|
If it's not a problem with shared libraries, there's a good chance
|
|
that the error logs will still contain useful error messages that
|
|
will help you figure out what the problem is.
|
|
<br>See also questions <a href="#q5.13">5.13</a> and
|
|
<a href="#q5.23">5.23</a>.</p>
|
|
|
|
<strong>5.8. <a name="q5.8">I'm having problems with indexing words
|
|
with accented characters.</a></strong><br>
|
|
<p>
|
|
Most of the time, this is caused by either not setting or
|
|
incorrectly setting the <a
|
|
href="attrs.html#locale">locale</a> attribute. The default locale
|
|
for most systems is the "portable" locale, which strips
|
|
everything down to standard ASCII. Most systems expect
|
|
something like <code>locale: en_US</code> or
|
|
<code>locale: fr_FR</code>. Locale files are often found in
|
|
<code>/usr/share/locale</code> or the <tt>$LANGUAGE</tt>
|
|
environment variable. See also question <a href="#q4.10">4.10</a>.
|
|
</p>
|
|
|
|
<p>Setting the locale correctly seems to be a frequent source of
|
|
frustration for ht://Dig users, so here are a few pointers which
|
|
some have found useful. First of all, if you don't have any luck
|
|
with the settings of the <a href="attrs.html#locale">locale</a>
|
|
attribute that you try, make sure you use a locale that is
|
|
defined on your system. As mentioned above, these are usually
|
|
installed in <code>/usr/share/locale</code>, so look there
|
|
for a directory named for the locale you want to use. If
|
|
you don't find it, but find something close, try that locale
|
|
name. Note that the locale may not have to be specific to the
|
|
language you're indexing, as long as it uses the same character
|
|
set. E.g. most western European languages use the ISO-8859-1
|
|
Latin 1 character set, so on most systems the locales for
|
|
all these languages define the same character types table
|
|
and can be used interchangeably. Some systems, however,
|
|
define only the accented letters used for a given language,
|
|
so "your mileage may vary." The important thing is that the
|
|
directory for your locale definition <strong>must</strong>
|
|
have a file named <code>LC_CTYPE</code> in it. For example,
|
|
on many Linux distributions, a language-specific locale like
|
|
<code>fr</code> won't contain this file, but country-specific
|
|
locales like <code>fr_FR</code> or <code>fr_CA</code> will. If
|
|
you don't find any appropriate locales installed on your system,
|
|
try obtaining and installing the locale definition files from
|
|
your OS distribution. Also, once you've set your locale, you need
|
|
to reindex all your documents in order for the locale to take
|
|
effect in the word database. This means rerunning the "rundig"
|
|
script, or running "htdig -i" and htmerge (or htpurge in the 3.2
|
|
betas).</p>
|
|
|
|
<p>Note also that some UNIX systems and libc5-based Linux
|
|
systems just don't have a working implementation of locales,
|
|
so you may not be able to get locales working at all on certain
|
|
systems. The
|
|
<a href="http://www.htdig.org/files/contrib/other/testlocale.c">testlocale.c</a>
|
|
program on our web site can let you see the LC_CTYPE tables
|
|
for any locale, to aid in finding one that works. Carefully
|
|
follow the directions in the program's comments to know how to
|
|
use it and what to look for in its output.</p>
|
|
|
|
<strong>5.9. <a name="q5.9">When I run htmerge, it stops with a
|
|
"Word sort failed" message.</a></strong><br>
|
|
<p>There are three common causes of this. First of all, the sort
|
|
program may be running out of temporary file space. Fix this
|
|
by freeing up some space where sort puts its temporary files,
|
|
or change the setting of the TMPDIR environment variable to a
|
|
directory on a volume with more space. A second common problem
|
|
is on systems with a BSD version of the sort program (such as
|
|
FreeBSD or NetBSD). This program uses the -T option as a record
|
|
separator rather than an alternate temporary directory. On these
|
|
systems, you must remove the TMPDIR environment variable from
|
|
rundig, or change the code in htmerge/words.cc not to use the
|
|
-T option. A third cause is the cron program on Red Hat Linux
|
|
5.0 or 5.1. (See question <a href="#q5.5">5.5</a> above.)</p>
|
|
|
|
<strong>5.10. <a name="q5.10">When htsearch has a lot of matches, it runs
|
|
extremely slowly.</a></strong><br>
|
|
<p>When you run htsearch with no customization, on a
|
|
large database, and it gets a lot of hits, it tends to
|
|
take a long time to process those hits. Some users with
|
|
large databases have reported much higher performance,
|
|
for searches that yield lots of hits, by setting the <a
|
|
href="attrs.html#backlink_factor">backlink_factor</a> attribute
|
|
in htdig.conf to 0, and sorting by score. The scores calculated
|
|
this way aren't quite as good, but htsearch can process hits
|
|
much faster when it doesn't need to look up the db.docdb record
|
|
for each hit, just to get the backlink count, date or title,
|
|
either for scoring or for sorting. This affects versions
|
|
3.1.0b3 and up. In version 3.2, currently under development,
|
|
the databases will be structured differently, so it should
|
|
perform searches more quickly.</p>
|
|
|
|
<p>In version 3.1.6, the date range selection code also slows
|
|
down htsearch for the same reason. Unfortunately, a small bug
|
|
crept into the code so that even if you don't set any of the
|
|
date range input parameters (startyear, endyear, etc.), and
|
|
you set backlink_factor and date_factor to 0, htsearch still
|
|
looks at the date in the db.docdb record for each hit. You can
|
|
avoid this either by setting startyear to 1969 and endyear to
|
|
2038 in your config file, or by applying this
|
|
<a href="ftp://ftp.ccsf.org/htdig-patches/3.1.6/timet_enddate.1">
|
|
patch</a>.</p>
|
|
|
|
<strong>5.11. <a name="q5.11">When I run htsearch, it gives me a count of
|
|
matches, but doesn't list the matching documents.</a></strong><br>
|
|
<p>This most commonly happens when you run htsearch while the
|
|
database is currently being rebuilt or updated by htdig.
|
|
If htdig and htmerge have run to completion, and the problem still
|
|
occurs, this is usually an indication of a corrupted database. If
|
|
it's finding matches, it's because it found the matching
|
|
words in db.words.db. However, it isn't finding the document
|
|
records themselves in db.docdb, which would suggest that either
|
|
db.docdb, or db.docs.index (which maps document IDs used in
|
|
db.words.db to URLs used to look up records in db.docdb), is
|
|
incomplete or messed up. You'll likely need to rebuild your
|
|
database from scratch if it's corrupted. Older versions of
|
|
ht://Dig were susceptible to database corruption of this
|
|
sort. Versions 3.1.2 and later are much more stable.</p>
|
|
|
|
<p>Another possible cause of this problem is unreadable result
|
|
template files. If you define external template files via the
|
|
<a href="attrs.html#template_map">template_map</a> attribute,
|
|
rather than using the builtin-short or builtin-long templates,
|
|
and the file names are incorrect or the files do not have
|
|
read permission for the user ID under which htsearch runs,
|
|
then htsearch won't be able to display the results. Also,
|
|
all directories leading up to these template files must be
|
|
searchable (i.e. executable) by htsearch, or it won't be able
|
|
to open the files. This is the opposite problem of that described
|
|
in question <a href="#q5.36">5.36</a>. If htsearch displays
|
|
nothing at all, you may have both problems.</p>
|
|
|
|
<strong>5.12. <a name="q5.12">I can't seem to index documents with names
|
|
like left_index.html with htdig.</a></strong><br>
|
|
<p>There is a bug in the implementation of the <a
|
|
href="attrs.html#remove_default_doc">remove_default_doc</a>
|
|
attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes
|
|
it to match more than it should. The default value for this
|
|
attribute is "index.html", so any URL in which the filename ends
|
|
with this string (rather than matches it entirely) will have
|
|
the filename stripped off. This is fixed in version 3.1.3.</p>
|
|
|
|
<strong>5.13. <a name="q5.13">I get Premature End of Script Headers errors
|
|
when running htsearch.</a></strong><br>
|
|
<p>This happens when htsearch dies before putting out a
|
|
"Content-Type" header. If you are running Apache under Solaris,
|
|
or another system that may be using shared libraries in non-standard
|
|
locations,
|
|
first try the solution described in question <a href="#q3.6">3.6</a>.
|
|
If that doesn't work, or you're running on another system, try
|
|
running "htsearch -vvv" directly from the command line to see where
|
|
and why it's failing. It should prompt you for the search words,
|
|
as well as the format.
|
|
<br>If it works from the command line, but not from the web
|
|
server, it's almost certainly a web server configuration problem.
|
|
Check your web server's error log for any information related to
|
|
htsearch's failure. One increasingly common problem is Apache
|
|
configurations which expect all CGI scripts to be Perl,
|
|
rather than binary executables or other scripts, so they use
|
|
"perl-handler" rather than "cgi-handler".
|
|
<br>See also questions <a href="#q5.7">5.7</a>,
|
|
<a href="#q5.14">5.14</a> and <a href="#q5.23">5.23</a>.</p>
|
|
|
|
<strong>5.14. <a name="q5.14">I get Segmentation faults when running
|
|
htdig, htsearch or htfuzzy.</a></strong><br>
|
|
<p>Despite a great deal of debugging of these programs, we haven't
|
|
been able to completely eliminate all such problems on all platforms.
|
|
If you're running htsearch or htfuzzy on a BSDI system, a common
|
|
cause of core dumps is due to a conflict between the GNU regex
|
|
code bundled in htdig 3.1.2 and later, and the BSD C or C++ library.
|
|
The solution is to use the BSD library's own rx code instead,
|
|
using version 3.1.6 or newer as summarized by Joe Jah:</p>
|
|
<ul>
|
|
<li> ./configure --with-rx
|
|
<li> make
|
|
</ul>
|
|
<p>This solution may work on some other platforms as well (we haven't
|
|
heard one way or the other), but will definitely not work on some
|
|
platforms. For instance, on libc5-based Linux systems, the bundled
|
|
regex code works fine by default, but using libc5's regex code
|
|
causes core dumps.</p>
|
|
|
|
<p>Users of Cobalt Raq or Qube servers have complained of
|
|
segmentation faults in htdig. Apparently this is due to problems
|
|
in their C++ libraries, which are fixed in their experimental
|
|
compiler and libraries. The following commands should install
|
|
the packages you need:</p>
|
|
<blockquote>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm<br>
|
|
rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm<br>
|
|
rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm
|
|
</blockquote>
|
|
<p>You may have to remove the libg++ package, if you have it installed
|
|
before installing libstdc++, because of conflicts in these packages.
|
|
Be sure to do a "make clean" before a "make", to remove any object
|
|
files compiled with the old compiler and headers.</p>
|
|
|
|
<p>For other causes of segmentation faults, or in other programs,
|
|
getting a stack backtrace after the fault can be useful in narrowing
|
|
down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core",
|
|
then enter the command "bt". You can also try running the program
|
|
directly under the debugger, rather than attempting a post-mortem
|
|
analysis of the core dump. Options to the program can be given on
|
|
gdb's "run" command, and after the program is suspended on fault,
|
|
you can use the "bt" command. This may give you enough information
|
|
to find and fix the problem yourself, or at least it may help others
|
|
on the htdig mailing list to point out what to do next.</p>
|
|
|
|
<strong>5.15. <a name="q5.15">Why does htdig 3.1.3 mangle URL parameters
|
|
that contain bare "&" characters?</a></strong><br>
|
|
<p>This is a known bug in 3.1.3, and is fixed with this
|
|
<a href="ftp://ftp.ccsf.org/htdig-patches/3.1.3/HTML.cc.0">
|
|
patch</a>. You can apply the patch by entering into the main
|
|
source directory for htdig-3.1.3, and using the command
|
|
"patch -p0 < /path/to/HTML.cc.0". This is
|
|
also fixed as of version 3.1.4.</p>
|
|
|
|
<strong>5.16. <a name="q5.16">When I run htmerge, it stops with an
|
|
"Unable to open word list file '.../db.wordlist'" message.</a></strong><br>
|
|
<p>The most common cause of this error is that htdig did not
|
|
manage to index any documents, and so it did not create a word
|
|
list. You should repeat the htdig or rundig command with the
|
|
-vvv option to see where and why it is failing.
|
|
See question <a href="#q4.1">4.1</a>.</p>
|
|
|
|
<strong>5.17. <a name="q5.17">When using Netscape, htsearch always returns the
|
|
"No match" page.</a></strong><br>
|
|
<p>Check your search form. Chances are there is a hidden input
|
|
field with no value defined. For example, one user had<br>
|
|
<code><input type=hidden name=restrict></code>
|
|
|
|
in his search form, instead of<br>
|
|
|
|
<code><input type=hidden name=restrict value=""></code>
|
|
|
|
The problem is that Netscape sets the missing value to a default of " "
|
|
(two spaces), rather than an empty string. For the restrict parameter,
|
|
this is a problem, because htsearch won't likely find any URLs with two
|
|
spaces in them. Other input parameters may similarly pose a problem.
|
|
</p>
|
|
|
|
<p>Another possibility, if you're running 3.2.0b1 or 3.2.0b2, is
|
|
that you need to make the db.words.db_weakcmpr file writeable by
|
|
the user ID under which the web server runs. This is a bug, and
|
|
is fixed in the 3.2.0b5 beta.</p>
|
|
|
|
|
|
<strong>5.18. <a name="q5.18">Why doesn't htdig follow links to other
|
|
pages in JavaScript code?</a></strong><br>
|
|
<p>There probably isn't any indexing tool in existance
|
|
that follows JavaScript links, because they don't know how
|
|
to initiate JavaScript events. Realistically, it would take a
|
|
full JavaScript parser in order to be able to figure out all the
|
|
possible URLs that the code could generate, something that's way
|
|
beyond the means of any search engine. You have a few options:</p>
|
|
<ul>
|
|
<li>Add "backup" links using plain HTML <a href=...> tags to
|
|
all the pages that could be accessed through JavaScript,
|
|
<li>Add <link> tags to point to all these pages (see
|
|
<a href="http://www.w3.org/TR/html4/struct/links.html#h-12.3.3">Links
|
|
and search engines</a> in W3C's HTML 4.0 Specification - requires
|
|
htdig 3.1.3 or greater, but then <em>everyone</em> should be running
|
|
3.1.6 or greater anyway),
|
|
<li>Compose a list of all the unreachable documents, or write
|
|
a program to do so, and feed that list as part of htdig's
|
|
<a href="attrs.html#start_url">start_url</a> attribute.
|
|
See also question <a href="#q5.25">5.25</a>.
|
|
</ul>
|
|
|
|
<strong>5.19. <a name="q5.19">When I run htsearch from the web server,
|
|
it returns a bunch of binary data.</a></strong><br>
|
|
<p>Your server is returning the contents of the htsearch binary.
|
|
Common causes of this are:</p>
|
|
<ul>
|
|
<li>no execute permission on the htsearch binary,
|
|
<li>the binary won't run on this system (it may be compiled
|
|
for the wrong system type), or
|
|
<li>the web server doesn't recognize the file as a CGI
|
|
(for Apache, you must have a ScriptAlias directive for the
|
|
program or the directory in which it's installed, or define
|
|
a cgi-script handler for some suffix, e.g. .cgi, and add that
|
|
suffix to the program file name).
|
|
</ul>
|
|
<p>By default, Apache is usually configured with one cgi-bin
|
|
directory as ScriptAlias, so all your CGI programs must go in
|
|
there, or have a .cgi suffix on them. Your configuration may
|
|
differ, however.</p>
|
|
|
|
<strong>5.20. <a name="q5.20">Why are the betas of 3.2 so
|
|
slow at indexing?</a></strong><br>
|
|
<p>
|
|
As the release notes for these versions suggest, they are
|
|
somewhat unoptimized and are made available for testing
|
|
Since the 3.2 code indexes all locations of words to support
|
|
phrase searching and other advanced methods, this additional
|
|
data slows down the indexer. To compensate, the code has a
|
|
cache configured by the
|
|
<a href="dev/htdig-3.2/attrs.html#wordlist_cache_size">wordlist_cache_size</a>
|
|
attribute.
|
|
As of this writing, the word database code will slow down
|
|
considerably when the cache fills up. Setting the cache as
|
|
large as possible provides considerable performance
|
|
improvement. Development is in progress to improve cache
|
|
performance.
|
|
For 3.2.0b6 and higher, see also the
|
|
<a href="dev/htdig-3.2/attrs.html#store_phrases">store_phrases</a> attribute,
|
|
which can turn off support for phrase searches, improving the speed.
|
|
</p>
|
|
|
|
<strong>5.21. <a name="q5.21">Why does htsearch use ";" instead of
|
|
"&" to separate URL parameters for the page buttons?</a></strong><br>
|
|
<p>In versions 3.1.5 and 3.2.0b2, and later, htsearch was
|
|
changed to use a semicolon character ";" as a parameter
|
|
separator for page button URLs, rather than "&", for HTML
|
|
4.0 compliance. It now allows both the "&" and the ";" as
|
|
separators for input parameters, because the CGI specification
|
|
still uses the "&". This change may cause some PHP or CGI
|
|
wrapper scripts to stop working, but these scripts should be
|
|
similarly changed to recognize both separator characters.
|
|
For the definitive reference on this issue, please refer to
|
|
section B.2.2 of W3C's HTML 4.0 Specification,
|
|
<a href="http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2">
|
|
Ampersands in URI attribute values</a>. We're all a little
|
|
tired of arguing about it. If you don't like the standard, you
|
|
can change the Display::createURL() code yourself to ignore it.
|
|
<br>See also question <a href="#q4.13">4.13</a>.</p>
|
|
|
|
<p>If you want to try working within the new standard, you may
|
|
find it helpful to know that recent versions of CGI.pm will
|
|
allow either the ampersand or semicolon as a parameter separator,
|
|
which should fix any Perl scripts that use this library.
|
|
In PHP, you can simply set the following in your php.ini file
|
|
to allow either separator:</p>
|
|
<pre>arg_separator.input = ";&"
|
|
</pre>
|
|
|
|
<strong>5.22. <a name="q5.22">Why does htsearch show the
|
|
"&" character as "&amp;" in search results?</a></strong><br>
|
|
<p>In version 3.1.5, htsearch was fixed to properly
|
|
re-encode the characters &, <, >, and "
|
|
into SGML entities. However, the default value for the
|
|
<a href="attrs.html#translate_amp">translate_amp</a>,
|
|
<a href="attrs.html#translate_lt_gt">translate_lt_gt</a>
|
|
and <a href="attrs.html#translate_quot">translate_quot</a>
|
|
attributes is still false, so these entities don't get converted
|
|
by htdig. If you set these three attributes to true in your
|
|
htdig.conf and reindex, the problem will go away.</p>
|
|
|
|
<p>In the 3.2 betas there was a bug in the HTML parser that
|
|
caused it to fail when attempting to translate the "&amp;"
|
|
entity. This has been fixed in 3.2.0b3. The translate_* attributes
|
|
are gone as of 3.2.0b2.</p>
|
|
|
|
<strong>5.23. <a name="q5.23">I get Internal Server or Unrecognized
|
|
character errors when running htsearch.</a></strong><br>
|
|
<p>An increasingly common problem is Apache configurations
|
|
which expect all CGI scripts to be Perl, rather than binary
|
|
executables or other scripts, so they use "perl-handler"
|
|
rather than "cgi-handler". The fix is to create a separate
|
|
directory for non-Perl CGI scripts, and define it as such in
|
|
your httpd.conf file. You should define it the same way as your
|
|
existing cgi-bin directory, but use "cgi-handler" instead of
|
|
"perl-handler". In any case, you should check your web server's
|
|
error log for any information related to htsearch's failure.
|
|
<br>See also questions <a href="#q5.7">5.7</a>,
|
|
<a href="#q5.14">5.14</a> and <a href="#q5.13">5.13</a>.</p>
|
|
|
|
<strong>5.24. <a name="q5.24">I took some settings out of
|
|
my htdig.conf but they're still set.</a></strong><br>
|
|
<p>All configuration file attributes have compiled-in, default
|
|
values. Taking an attribute out of the file is not the same
|
|
thing as setting it to an empty string, a 0, or a value of
|
|
false. See question <a href="#q4.18">4.18</a>.</p>
|
|
|
|
<strong>5.25. <a name="q5.25">When I run htdig on my site,
|
|
it misses entire directories.</a></strong><br>
|
|
<p>First of all, htdig doesn't look at directories itself. It
|
|
is a spider, and it follows hypertext links in HTML documents.
|
|
If htdig seems to be missing some documents or entire directory
|
|
sub-trees of your site, it is most likely because there are
|
|
no HTML links to these documents or directories. (See also
|
|
question <a href="#q5.18">5.18</a>.) If htdig does
|
|
not come across at least one hypertext link to a document
|
|
or directory, and it's not explicitly listed in the
|
|
<a href="attrs.html#start_url">start_url</a> attribute, then
|
|
this document or directory is essentially hidden from view
|
|
to htdig, or to any web browser or spider for that matter.
|
|
You can only get htdig to index directories, without providing
|
|
your own files with links to the contents of these directories,
|
|
by using your web server's automatic index generation feature.
|
|
In Apache, this is done with the mod_autoindex module, which
|
|
is usually compiled-in by default, and is enabled with the
|
|
"Indexes" option for a given directory hierarchy. For example,
|
|
you can put these directives in your Apache configuration:</p>
|
|
<pre>
|
|
<Directory "/path/to/your/document/root">
|
|
Options Indexes FollowSymLinks Includes ExecCGI
|
|
</Directory>
|
|
</pre>
|
|
<p>This will cause Apache to automatically generate an index
|
|
for any directory that does not have an index.html or other
|
|
"DirectoryIndex" file in it. Other web servers will have
|
|
similar features, which you should look for in your server
|
|
documentation.</p>
|
|
|
|
<p>As an alternative to relying on the web server's autoindex
|
|
feature, you can compose a list of all the unreachable
|
|
documents, or write a program to do so, and feed that list as
|
|
part of htdig's <a href="attrs.html#start_url">start_url</a>
|
|
attribute. Here is an example of simple shell script to make
|
|
a file of URLs you can use with a configuration entry like
|
|
<code>start_url: `/path/to/your/file`</code>:</p>
|
|
<pre>
|
|
find /path/to/your/document/root -type f -name \*.html -print | \
|
|
sed -e 's|/path/to/your/document/root/|http://www.yourdomain.com/|' > \
|
|
/path/to/your/file
|
|
</pre>
|
|
<p>Other reasons why htdig might be missing portions of your
|
|
site might be that they fall out of the bounds specified
|
|
by the <a href="attrs.html#limit_urls_to">limit_urls_to</a>
|
|
attribute (which takes on the value of start_url by default),
|
|
they are explicitly excluded using the
|
|
<a href="attrs.html#exclude_urls">exclude_urls</a> attribute,
|
|
or they are disallowed by a robots.txt file (see the
|
|
<a href="htdig.html">htdig</a> documentation for notes about
|
|
robot exclusion) or by a robots meta tag (see question
|
|
<a href="#q4.15">4.15</a>). If htdig seems to be missing the
|
|
last part of a large directory or document, see question
|
|
<a href="#q5.1">5.1</a>. For reasons why htdig may be rejecting
|
|
some links to parts of your site, see question
|
|
<a href="#q5.27">5.27</a>.</p>
|
|
|
|
<strong>5.26. <a name="q5.26">What do all the numbers and symbols
|
|
in the htdig -v output mean?</a></strong><br>
|
|
<p>Output from htdig -v typically looks like this:</p>
|
|
<pre>
|
|
23000:35506:2:http://xxx.yyy.zz/index.html: ***-+****--++***+ size = 4056
|
|
</pre>
|
|
<p>The first number is the number of documents parsed so far,
|
|
the second is the DocID for this document, and the third is
|
|
the hop count of the document (number of hops from one of the
|
|
start_url documents). After the URL, it shows a "*" for a link
|
|
in the document that it already visited (or at least queued
|
|
for retrieval), a "+" for a new link it just queued, and a
|
|
"-" for a link it rejected for any of a number of reasons.
|
|
To find out what those reasons are, you need to run htdig
|
|
with at least 3 "v" options, i.e. -vvv. If there are no "*",
|
|
"+" or "-" symbols after the URL, it doesn't mean the document
|
|
was not parsed or was empty, but only that no links to other
|
|
documents were found within it.</p>
|
|
|
|
<strong>5.27. <a name="q5.27">Why is htdig rejecting some of the
|
|
links in my documents?</a></strong><br>
|
|
<p>When htdig parses documents and finds hypertext links to
|
|
other documents (hrefs), it may reject them for any of several
|
|
reasons. To find out what those reasons are, you need to run
|
|
htdig with at least 3 "v" options, i.e. -vvv. Here are the
|
|
meanings of some of the messages you might see at this verbosity
|
|
level.</p>
|
|
<dl>
|
|
<dt>Not an http or relative link!</dt>
|
|
<dd>In versions 3.1.5 and earlier, only "http://" URLs, or
|
|
URLs relative to those, are allowed.</dd>
|
|
<dt>Item in the exclude list: item # <em>n</em></dt>
|
|
<dd>A substring of the URL matches one of the items in the
|
|
<a href="attrs.html#exclude_urls">exclude_urls</a>
|
|
attribute. The given item number will indicate which
|
|
pattern matched, starting at 1. The 3.2.0 betas do not
|
|
give the item number.</dd>
|
|
<dt>Extension is invalid!</dt>
|
|
<dd>The file name extension or suffix matches one of those
|
|
listed in the
|
|
<a href="attrs.html#bad_extensions">bad_extensions</a>
|
|
attribute.</dd>
|
|
<dt>Extension is not valid!</dt>
|
|
<dd>The file name extension or suffix does not match one of those
|
|
listed in the
|
|
<a href="attrs.html#valid_extensions">valid_extensions</a>
|
|
attribute, if any are specified.</dd>
|
|
<dt>Invalid Querystring! <em>or</em><br>item in bad query list</dt>
|
|
<dd>The URL contains a query string which matches one of those
|
|
listed in the
|
|
<a href="attrs.html#bad_querystr">bad_querystr</a>
|
|
attribute.</dd>
|
|
<dt>URL not in the limits!</dt>
|
|
<dd>No substring of the URL entirely matches one of the items in the
|
|
<a href="attrs.html#limit_urls_to">limit_urls_to</a>
|
|
attribute. The purpose of this attribute is to keep htdig
|
|
from attempting to index the entire World Wide Web.</dd>
|
|
<dt>forbidden by server robots.txt!</dt>
|
|
<dd>A substring of the URL matches one of the items disallowed
|
|
in the servers robots.txt file. See
|
|
<a href="http://www.robotstxt.org/wc/norobots.html">
|
|
A Standard for Robot Exclusion</a>. This message exists
|
|
only in the 3.2.0 betas. In 3.1.5 and earlier, this condition
|
|
is only caught later, resulting in the message
|
|
"robots.txt: discarding '<em>URL</em>'" from htdig, and a
|
|
later "Deleted: no excerpt" message from htmerge.</dd>
|
|
<dt>url rejected: (level 2)</dt>
|
|
<dd>No substring of the URL entirely matches one of the items in the
|
|
<a href="attrs.html#limit_normalized">limit_normalized</a>
|
|
attribute. All the other rejections above will be indicated
|
|
as level 1. The 3.2.0 betas give the much more meaningful
|
|
message 'not in "limit_normalized" list!'</dd>
|
|
</dl>
|
|
|
|
<p>Another possibility, if none of the error messages above appear
|
|
for some of the links you think htdig should be accepting, is that
|
|
htdig isn't even finding the links at all. First, make sure you're
|
|
not making false assumptions about how htdig finds these. It only
|
|
reads links in HTML code, and not JavaScript, and it doesn't read
|
|
directories unless the HTTP server is feeding it directory listings.
|
|
You will need to take a close look at the htdig -vvv (or -vvvv)
|
|
output to see what htdig is finding, in and around the areas where
|
|
the desired links are supposed to be found in your HTML code, to see
|
|
if it's actually finding them.
|
|
See also question <a href="#q5.25">5.25</a>.</p>
|
|
|
|
<strong>5.28. <a name="q5.28">When I run htdig or htmerge, I get a
|
|
"DB2 problem...: missing or empty key value specified" message.</a></strong><br>
|
|
<p>The most common cause of this error is that htdig or
|
|
htmerge rejected any documents that had been put in the
|
|
database, leaving an empty database. You need to find out the
|
|
reasons for the rejection of these documents. See questions
|
|
<a href="#q4.1">4.1</a>, <a href="#q5.25">5.25</a> and
|
|
<a href="#q5.27">5.27</a>.</p>
|
|
|
|
<strong>5.29. <a name="q5.29">When I run htdig on my site,
|
|
it seems to go on and on without ending.</a></strong><br>
|
|
<p>There are some things that can cause htdig to run on without
|
|
ending, especially when indexing dynamic content (ASP, PHP,
|
|
SSI or CGI pages). This usually involves htdig getting caught
|
|
in an <em>infinite virtual hierarchy</em>. A sure sign of
|
|
this is if the current size of your database is much larger
|
|
than the total size of the site you are indexing, or if in the
|
|
verbose output of htdig (see question <a href="#q4.1">4.1</a>)
|
|
you see the same URLs come up again and again with only subtle
|
|
variations. In any case, you must figure out the reason htdig
|
|
keeps revisiting the same documents using different URLs, as
|
|
explained in question <a href="#q4.24">4.24</a>, and set your
|
|
<a href="attrs.html#exclude_urls">exclude_urls</a> and
|
|
<a href="attrs.html#bad_querystr">bad_querystr</a> attributes
|
|
appropriately to stop htdig from going down those paths.
|
|
</p>
|
|
|
|
<strong>5.30. <a name="q5.30">Why does htsearch no longer recognize
|
|
the -c option when run from the web server?</a></strong><br>
|
|
<p>This was a security hole in 3.1.5 and older, and 3.2.0b3 and
|
|
older releases of ht://Dig. (See question <a href="#q2.1">2.1</a>.)
|
|
There's a compile-time macro you can set in htsearch.cc to disable
|
|
this security fix, but that's a bad idea because it reopens the hole.
|
|
This should only be done as a last recourse, when all other avenues
|
|
fail. The -c option was only intended for testing htsearch from the
|
|
command line, and not for use when calling htsearch on the web server.
|
|
Unfortunately, far too many users have needlessly latched onto this
|
|
option for CGI scripts. The preferred ways of specifying the config
|
|
file are as follows, in order of preference:</p>
|
|
<ol>
|
|
<li>use the "config" input parameter in your
|
|
<a href="hts_form.html">search form</a>
|
|
(see question <a href="#q4.2">4.2</a>).
|
|
<li>if you need to get at files outside the default CONFIG_DIR, use a
|
|
wrapper script that redefines the CONFIG_DIR environment variable,
|
|
then use the config input parameter as above
|
|
(see question <a href="#q4.20">4.20</a>).
|
|
<li>use a wrapper script to force htsearch to use a specific config
|
|
file using the -c option. This is especially for cases where you
|
|
want to prevent the user from selecting other config files in your
|
|
CONFIG_DIR using the config input parameter. This should
|
|
be done by using the GET method to call the wrapper script, and in
|
|
this script you must unset the REQUEST_METHOD enviroment variable
|
|
and pass "$QUERY_STRING" as a single argument to htsearch.
|
|
(This safely gets around htsearch's test which disables -c.)
|
|
<li>configure and compile different htsearch binaries with different
|
|
compile-time definitions of CONFIG_DIR, so you can avoid wrapper
|
|
scripts altogether.
|
|
<li>define ALLOW_INSECURE_CGI_CONFIG in htsearch.cc and recompile
|
|
htsearch if all other approaches above fail for you.
|
|
</ol>
|
|
|
|
<strong>5.31. <a name="q5.31">I've set a config attribute exactly
|
|
as documented but it seems to have no effect.</a></strong><br>
|
|
<p>There are a few fairly common reasons why this might happen:</p>
|
|
<ol>
|
|
<li>You may have a typo. Spelling matters, so make sure the attribute
|
|
name is spelled exactly as it is in the
|
|
<a href="attrs.html">documentation</a>. Misspelled attribute
|
|
definitions are silently ignored. This is because you're allowed
|
|
to make up your own attribute definitions for use by other attribute
|
|
definitions, as <strong>${myownattribute}</strong>. Also remember
|
|
to put the colon ("<strong>:</strong>") separator between the
|
|
attribute name and value in your definition.
|
|
<li>The attribute isn't supported in your version of the software.
|
|
The <a href="attrs.html">documented configuration attributes</a>
|
|
on the www.htdig.org web site are for the most recent
|
|
<strong>stable</strong> release. See questions
|
|
<a href="#q2.1">2.1</a> and <a href="#q2.7">2.7</a> for details.
|
|
If you're running an older version, or even a more recent beta
|
|
release, you may not have the same set of attributes to work with.
|
|
Consult the appropriate documentation, or upgrade to the current
|
|
release.
|
|
<li>You're not modifying the right configuration file. The default
|
|
configuration file is specified when you first configure ht://Dig
|
|
before compiling, but other configuration files can be specified
|
|
at run time, using the -c command-line option for most programs,
|
|
or the <strong>config</strong> input parameter for htsearch
|
|
(see question <a href="#q4.2">4.2</a>).
|
|
<li>You've got more than one definition of the attribute. Only the
|
|
last occurrence of an attribute in the configuration file is the
|
|
definition that's used for that attribute, overriding earlier
|
|
definitions. This also applies for nested configuration files that
|
|
are loaded in via the <a href="attrs.html#include">include</a>
|
|
directive, so check for other definitions in all included files.
|
|
Similarly for htsearch, look out for multiple definitions of input
|
|
parameters in your search forms, as mentioned in question
|
|
<a href="#q4.2">4.2</a> - these don't override each other but they
|
|
get combined with a Ctrl-A as separator, which may not be what you
|
|
want either.
|
|
<li>Your attribute definition is being "swallowed up" by an
|
|
incomplete multi-line definition above it. Remember that when a line
|
|
of an attribute definition ends with a single backslash
|
|
("<strong>\</strong>") before the end of the line (without any
|
|
space after the backslash), then the following line is appended to
|
|
it as a continuation of the same attribute definition. For an
|
|
attribute definition that spans several lines, all lines but the
|
|
last must end with a backslash. If you want a backslash to go into
|
|
the attribute definition literally, it must be doubled-up, as
|
|
<strong>\\</strong>.
|
|
<li>On a similar note, make sure your attribute definitions are all
|
|
terminated by a newline character. Beware of text editors that do
|
|
word wrapping. It may look like two separate lines on the screen,
|
|
when it fact you've got two attribute definitions on the same long
|
|
line, so the second is swallowed up as part of the first.
|
|
<li>Your attribute definition is being overridden by an htsearch
|
|
<a href="hts_form.html">CGI input parameter</a>. For example,
|
|
<a href="attrs.html#template_name">template_name</a> is ignored
|
|
if the <strong>format</strong> input parameter is defined. The
|
|
<a href="attrs.html#allow_in_form">allow_in_form</a> attribute
|
|
can define any number of new CGI input parameters that override
|
|
the attributes of the same name in your config file.
|
|
<li>Your attribute definition is being ignored or overridden
|
|
by a related attribute. Watch out for unexpected interactions
|
|
between different attributes. For instance, characters in
|
|
<a href="attrs.html#valid_punctuation">valid_punctuation</a>
|
|
are stripped out of words, so those characters may
|
|
not have the effect you want if you've added them to
|
|
<a href="attrs.html#extra_word_characters">extra_word_characters</a>
|
|
or
|
|
<a href="attrs.html#prefix_match_character">prefix_match_character</a>.
|
|
Also,
|
|
<a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
|
|
will override
|
|
<a href="attrs.html#search_results_header">search_results_header</a>
|
|
and
|
|
<a href="attrs.html#search_results_footer">search_results_footer</a>,
|
|
but only if you've set up the wrapper file correctly.
|
|
<li>Watch out for possible "latent effects" of some attributes. For
|
|
example, when you change attributes used by htdig, they won't have
|
|
an immediate effect on entries already in the database, so you would
|
|
have to reindex your site before they take effect. Similarly,
|
|
attributes that affect how htfuzzy builds some of its databases
|
|
don't take effect until those databases are rebuilt. Another, more
|
|
subtle latent effect occurs with releases 3.1.6 and 3.2 betas:
|
|
when you interrupt htdig (i.e. with Control-C or a kill command),
|
|
it stores the list of currently queued URLs in db.log, in your
|
|
database directory, so that the next time you invoke htdig it can
|
|
resume the interrupted dig. A side-effect of this file is that if
|
|
you change some attributes like limit_urls_to or exclude_urls before
|
|
restarting, the URLs in the file are still taken as-is, having been
|
|
checked against the old settings of limit_urls_to or exclude_urls
|
|
before being queued. This might explain one reason htdig seems to
|
|
ignore your new settings of these.
|
|
</ol>
|
|
|
|
<strong>5.32. <a name="q5.32">When I run htsearch, it gives a page
|
|
with an "Unable to read configuration file" message.</a></strong><br>
|
|
<p>The most common causes of this error are:</p>
|
|
<ul>
|
|
<li>Your configuration file name is misspelled in the "config"
|
|
input parameter of your search form, or you have two definitions
|
|
of this parameter (see question <a href="#q4.2">4.2</a>).
|
|
<li>You didn't install your configuration file in the directory
|
|
defined by the CONFIG_DIR compile-time Makefile variable
|
|
(see also question <a href="#q4.20">4.20</a>). This is where
|
|
htsearch will look for the configuration file specified by the
|
|
"config" input parameter.
|
|
<li>The configuration file is not readable by the user ID under
|
|
which your web server, and thus htsearch, runs. Similarly,
|
|
if the directories from CONFIG_DIR up to the root directory
|
|
are not executable by this same user ID, htsearch won't be
|
|
able to access the configuration files.
|
|
</ul>
|
|
|
|
<strong>5.33. <a name="q5.33">How can I find out which version
|
|
of ht://Dig I have installed?</a></strong><br>
|
|
<p>You should always check which version of ht://Dig you're
|
|
running, before you report any problems, or even if you
|
|
suspect a problem. You can find out the version number of an
|
|
installed ht://Dig package by running the command:</p>
|
|
<blockquote>
|
|
<code>htdig -\? | head</code>
|
|
</blockquote>
|
|
<p>(or use "more" if you don't have a "head" command). The
|
|
full version number appears on the third line of output,
|
|
after "This program is part of ht://Dig", and it should also
|
|
include the snapshot date if you're running a pre-release
|
|
snapshot. Always include this full version number with any
|
|
bug report or problem report on a mailing list. You can save
|
|
yourself and others a lot of grief by being certain of which
|
|
version you're running, especially if you've installed more than
|
|
one. If you're running ht://Dig from an RPM package, you should
|
|
also report the package version and release number, which you
|
|
can determine with the command "<code>rpm -q htdig</code>",
|
|
and mention where you obtained the package. This will alert
|
|
us to the ideosyncracies and/or patches in a particular RPM
|
|
package. Also, if you've applied any patches yourself (see
|
|
question <a href="#q2.5">2.5</a>) please mention which ones.
|
|
See also question <a href="#q1.8">1.8</a>, on reporting bugs
|
|
or configuration problems.</p>
|
|
|
|
<strong>5.34. <a name="q5.34">When running htdig, I get "Error (0):
|
|
PDF file is damaged - attempting to reconstruct xref table..."</a></strong><br>
|
|
<p>This message comes from the pdftotext utility, when a PDF file
|
|
has been truncated. Find the largest PDF file on the site you're
|
|
indexing, and set max_doc_size to at least that size (see question
|
|
<a href="#q5.2">5.2</a>). If you need to track down which PDF is
|
|
causing the error, try running "htdig -i -v > log.txt 2>&1" so you
|
|
can see which URL is being indexed when the error occurs. The output
|
|
redirects in that command combine stdout (where htdig's output goes)
|
|
and stderr (where pdftotext's error messages go) into one output
|
|
stream. If you're using acroread to index PDF files, the error
|
|
message for a truncated PDF file is simply "Could not repair file."
|
|
It's also possible to get errors like this from PDF files that are
|
|
smaller than max_doc_size, if they're already truncated or corrupted
|
|
on the server.</p>
|
|
|
|
<strong>5.35. <a name="q5.35">When running htdig on Mandrake Linux,
|
|
I get "host not found" and "no server running" errors.</a></strong><br>
|
|
<p>The default htdig.conf configuration in Mandrake's RPM package
|
|
of htdig very stupidly enables the
|
|
<a href="attrs.html#local_urls_only">local_urls_only</a> attribute
|
|
by default, which means you can only index a limited set of files
|
|
on the local server. Anything else, where htdig would normally fall
|
|
back to using HTTP, will fail. To make matters worse, they put a very
|
|
misleading comment above that attribute setting, which throws users
|
|
off track. This attribute is useful in certain circumstances where
|
|
you never want htdig to fall back to HTTP, but enabling it by default
|
|
was a very bad judgement call on Mandrake's part.</p>
|
|
|
|
<strong>5.36. <a name="q5.36">When I run htsearch, it gives me the
|
|
list of matching documents, but no header or footer.</a></strong><br>
|
|
<p>The header and footer typically contain the followup search
|
|
form, an indication of the total number of matches, and buttons
|
|
to other pages of matches if the results don't fit on one
|
|
page. If these don't show up, it could be that in attempting
|
|
to customize these (see question <a href="#q4.2">4.2</a>),
|
|
you removed them or rendered them unusable. Even if you didn't
|
|
customize them, make sure you installed the
|
|
<a href="attrs.html#search_results_header">search_results_header</a>
|
|
and
|
|
<a href="attrs.html#search_results_footer">search_results_footer</a>
|
|
files (or the
|
|
<a href="attrs.html#search_results_wrapper">search_results_wrapper</a>
|
|
file) in the correct location (where you told ht://Dig they'd be
|
|
when you configured prior to compiling). Also make sure they
|
|
have read permission for the user ID under which htsearch runs,
|
|
and all directories leading up to these template files are
|
|
searchable (i.e. executable) by htsearch, or it won't be able
|
|
to open the files.</p>
|
|
|
|
<p>This is the opposite problem of that described in question
|
|
<a href="#q5.11">5.11</a>. If htsearch displays nothing at
|
|
all, you may have both problems or you may have no matches or
|
|
a boolean query syntax error and the
|
|
<a href="attrs.html#nothing_found_file">nothing_found_file</a>
|
|
or <a href="attrs.html#syntax_error_file">syntax_error_file</a>
|
|
is missing or unreadable.</p>
|
|
|
|
<strong>5.37. <a name="q5.37">When I index files with doc2html.pl,
|
|
it fails with the "UNABLE to convert" error.</a></strong><br>
|
|
<p>This is an indication that doc2html.pl wasn't configured
|
|
properly. Carefully follow all the directions for installation
|
|
in the DETAILS file that comes with the script. In addition to
|
|
installing doc2html.pl, you must:</p>
|
|
<ul>
|
|
<li>Install xpdf and check that pdftotext and pdfinfo work from
|
|
the command line,
|
|
<li>Configure pdf2html.pl to use pdftotext and pdfinfo and check
|
|
that it works from the command line,
|
|
<li>Configure doc2html.pl to use pdf2html.pl and check that it
|
|
works from the command line:
|
|
<pre>doc2html.pl /full/path/to/sample/filename.pdf "application/pdf" url</pre>
|
|
</ul>
|
|
<p>You should repeat a similar set of steps to configure and test
|
|
doc2html.pl for other document types, such as Word, RTF, Excel and
|
|
other document types. See also questions <a href="#q4.8">4.8</a>,
|
|
<a href="#q4.9">4.9</a> and <a href="#q5.39">5.39</a>.</p>
|
|
|
|
<strong>5.38. <a name="q5.38">Why do my searches find search terms
|
|
in pathnames, or how do I prevent matching filenames?</a></strong><br>
|
|
<p>htdig doesn't normally add the URL components to the index
|
|
itself, but when you index a directory where the filenames are
|
|
used as link description text (such as an automatic DirectoryIndex
|
|
created by Apache's mod_autoindex) then these link descriptions
|
|
get indexed, carrying the weight assigned to them by the
|
|
<a href="attrs.html#description_factor">description_factor</a>
|
|
attribute. Thus, a search for a filename will match this link
|
|
description, and the file will show up in search results.
|
|
To avoid that, make sure your DirctoryIndexes don't get indexed
|
|
as detailed in question <a href="#q4.23">4.23</a>.</p>
|
|
|
|
<p>Conversely, there is no way to force htdig to index URL
|
|
components so that a search for a file name will yield a match
|
|
on that file, unless you index an HTML file (or several) containing
|
|
links to all the files you want, where the link description text
|
|
does contain the full URL or the pathname components you want.</p>
|
|
|
|
<strong>5.39. <a name="q5.39">I set up an external parser but I still
|
|
can't index Word/Excel/PowerPoint/PDF documents.</a></strong><br>
|
|
<p>You probably need to carefully re-read and follow questions
|
|
<a href="#q4.8">4.8</a>, <a href="#q4.9">4.9</a>,
|
|
<a href="#q5.25">5.25</a> and <a href="#q5.27">5.27</a>.
|
|
When you can't index documents with an external parser or converter,
|
|
there are three main issues, or points of failure, that you need
|
|
to resolve. You need to figure out on which of the three stages the
|
|
process is failing, and focus on that stage to get to the bottom of
|
|
why it's not working at that stage. You need to run htdig with
|
|
anywhere from 1 to 4 -v options, to get the debugging output you
|
|
need to see where it's failing and why. This may be an iterative
|
|
process, if htdig is failing at more than one stage: you might fix
|
|
one problem only to run into another.</p>
|
|
|
|
<ol>
|
|
<li>Is htdig actually finding links to the PDF, Word, etc. documents
|
|
you want to index? Make sure you're not making false assumptions
|
|
about how htdig finds these (questions <a href="#q5.25">5.25</a>
|
|
and <a href="#q5.18">5.18</a>), and then find out how htdig is
|
|
looking at the links in your HTML files to see if it's ignoring
|
|
or rejecting links to your externally parsed documents (questions
|
|
<a href="#q4.1">4.1</a> and <a href="#q5.27">5.27</a>).<br><br>
|
|
<li>If it is finding and accepting the links to these documents, is
|
|
it correctly fetching them and passing them on to the appropriate
|
|
external converter to be able to index them? Look at htdig -vvv
|
|
output, around the time it tries to fetch one of these, and see
|
|
what it does next. Does the file size look right? Are there any
|
|
error messages around there? If the external converter isn't even
|
|
being called, take a close look at your
|
|
<a href="attrs.html#external_parsers">external_parsers</a>
|
|
attribute setting to make sure it's correct (see question
|
|
<a href="#q5.31">5.31</a>).<br><br>
|
|
<li>If it is attempting to convert them, is the external converter
|
|
doing what it should, to feed some indexable text back into htdig's
|
|
parser? You can also try htdig -vvvv (4 -v options) to see if it's
|
|
actually parsing individual words from any of these. If this is
|
|
too much output to wade through, try setting
|
|
<a href="attrs.html#start_url">start_url</a> to the URL
|
|
of a single document that you want to test, so you can look in
|
|
detail at what htdig does with it. You can also try running the
|
|
external converter manually on one of these documents to see
|
|
what it spits out. See question <a href="#q5.37">5.37</a>.
|
|
Make sure your documents actually contain indexable text. Some
|
|
PDFs are nothing but scanned images of pages, so it looks like
|
|
text but it's just images with no computer-readable text.
|
|
</ol>
|
|
|
|
<br>
|
|
|
|
<hr noshade size=4>
|
|
Last modified: $Date: 2004/05/28 13:15:16 $
|
|
<br>
|
|
<a href="http://sourceforge.net/">
|
|
<img src="http://sourceforge.net/sflogo.php?group_id=4593&type=1" width="88" height="31" border="0" alt="SourceForge Logo"></a>
|
|
</body>
|
|
</html>
|