You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
393 lines
11 KiB
393 lines
11 KiB
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
|
|
<html>
|
|
<head>
|
|
<title>
|
|
ht://Dig: Features and System requirements
|
|
</title>
|
|
</head>
|
|
<body bgcolor="#eef7ff">
|
|
<h1>
|
|
Features and System requirements
|
|
</h1>
|
|
<p>
|
|
ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br>
|
|
Please see the file <a href="COPYING">COPYING</a> for
|
|
license information.
|
|
</p>
|
|
<hr noshade>
|
|
<h2>
|
|
Features
|
|
</h2>
|
|
<p>
|
|
Here are some of the major features of ht://Dig. They are in
|
|
no particular order.
|
|
</p>
|
|
<blockquote>
|
|
<dl>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Intranet searching</strong>
|
|
</dt>
|
|
<dd>
|
|
ht://Dig has the ability to search through many servers
|
|
on a network by acting as a WWW browser.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
It is free</strong>
|
|
</dt>
|
|
<dd>
|
|
The whole system is released under the
|
|
<a href="COPYING">GNU Library General Public License (LGPL)</a>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Robot exclusion is supported</strong>
|
|
</dt>
|
|
<dd>
|
|
The <a href="http://www.robotstxt.org/wc/norobots.html">
|
|
Standard for Robot Exclusion</a> is
|
|
<a href="meta.html#robots">supported by ht://Dig.</a>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Boolean expression searching</strong>
|
|
</dt>
|
|
<dd>
|
|
Searches can be arbitrarily complex using boolean
|
|
expressions.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Phrase searching</strong>
|
|
</dt>
|
|
<dd>
|
|
A phrase can be searched for by enclosing it in quotes.
|
|
Phrase searches can be combined with word searches, as in
|
|
<code>Linux and "high quality"</code>.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Configurable search results</strong>
|
|
</dt>
|
|
<dd>
|
|
The output of a search can easily be tailored to your
|
|
needs by means of providing HTML templates.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Fuzzy searching</strong>
|
|
</dt>
|
|
<dd>
|
|
Searches can be performed using various
|
|
<a href="attrs.html#search_algorithm">configurable algorithms</a>.
|
|
Currently the following algorithms are
|
|
supported (in any combination):
|
|
<ul>
|
|
<li>
|
|
exact
|
|
</li>
|
|
<li>
|
|
soundex
|
|
</li>
|
|
<li>
|
|
metaphone
|
|
</li>
|
|
<li>
|
|
common word endings
|
|
</li>
|
|
<li>
|
|
synonyms
|
|
</li>
|
|
<li>
|
|
accent stripping
|
|
</li>
|
|
<li>
|
|
substring and prefix
|
|
</li>
|
|
<li>
|
|
regular expressions
|
|
</li>
|
|
<li>
|
|
simple spelling corrections
|
|
</li>
|
|
</ul>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Searching of many file formats</strong>
|
|
</dt>
|
|
<dd>
|
|
Both HTML documents and plain text files can be
|
|
searched directly ht://Dig itself. There is also a
|
|
<a href="attrs.html#external_parsers">mechanism
|
|
to allow external programs ("external parsers")</a> to be used
|
|
while building the database so that arbitrary file formats
|
|
can be searched. <br>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Document retrieval using many transport services</strong>
|
|
</dt>
|
|
<dd>
|
|
Several transport services can be handled by ht://Dig,
|
|
including http://, ftp:// and file:///.
|
|
There is also a
|
|
<a href="attrs.html#external_protocols">mechanism
|
|
to allow external programs ("external protocols")</a> to be used
|
|
while building the database so that arbitrary transport
|
|
services can be used. <br>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Keywords can be added to HTML documents</strong>
|
|
</dt>
|
|
<dd>
|
|
Any number of <a href="meta.html">keywords</a>
|
|
can be added to HTML documents
|
|
which will not show up when the document is viewed.
|
|
This is used to make a document more like to be found
|
|
and also to make it appear higher in the list of
|
|
matches.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Email notification of expired documents</strong>
|
|
</dt>
|
|
<dd>
|
|
Special meta information can be added to HTML documents
|
|
which can be used to
|
|
<a href="notification.html">notify the maintainer</a> of those
|
|
documents at a certain time. It is handy to get
|
|
reminded when to remove the "New" images from a certain
|
|
page, for example.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
A Protected server can be indexed</strong>
|
|
</dt>
|
|
<dd>
|
|
ht://Dig can be told to use a specific
|
|
<a href="attrs.html#authorization">username and password</a>
|
|
when it retrieves documents. This can be used
|
|
to index a server or parts of a server that are
|
|
protected by a username and password.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Searches on subsections of the database</strong>
|
|
</dt>
|
|
<dd>
|
|
It is easy to set up a search which only returns
|
|
documents whose
|
|
<a href="hts_form.html#restrict">URL matches a certain pattern.</a>
|
|
This becomes very useful for people who want to make their
|
|
own data searchable without having to use a separate
|
|
search engine or database.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Full source code included</strong>
|
|
</dt>
|
|
<dd>
|
|
The search engine comes with full source code. The
|
|
whole system is released under the terms and conditions
|
|
of the <a href="COPYING">GNU Library General Public License (LGPL) version
|
|
2.0</a>
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
The depth of the search can be limited</strong>
|
|
</dt>
|
|
<dd>
|
|
Instead of limiting the search to a set of machines, it
|
|
can also be restricted to documents that are a certain
|
|
number of <a href="attrs.html#max_hop_count">"mouse-clicks"</a>
|
|
away from the start document.
|
|
</dd>
|
|
<dt>
|
|
<strong><img src="bdot.gif" width=9 height=9 alt="*">
|
|
Full support for the ISO-Latin-1 character set</strong>
|
|
</dt>
|
|
<dd>
|
|
Both SGML entities like '&agrave;' and ISO-Latin-1
|
|
characters can be indexed and searched.
|
|
</dd>
|
|
</dl>
|
|
</blockquote>
|
|
<hr size="4" noshade>
|
|
<h1>
|
|
Requirements to build ht://Dig
|
|
</h1>
|
|
<p>
|
|
ht://Dig was developed under Unix using C++.
|
|
</p>
|
|
<p>
|
|
For this reason, you will need a Unix machine, a C compiler
|
|
and a C++ compiler. (The C compiler is needed to compile some
|
|
of the GNU libraries)
|
|
</p>
|
|
<p>
|
|
Unfortunately, we only have access to a couple of different
|
|
Unix machines. ht://Dig has been tested on these machines:
|
|
</p>
|
|
<ul>
|
|
<!--
|
|
<li>
|
|
Sun Solaris 2.5 SPARC (using gcc/g++ 2.7.2)
|
|
</li>
|
|
<li>
|
|
Sun SunOS 4.1.4 SPARC (using gcc/gcc 2.7.0)
|
|
</li>
|
|
<li>
|
|
HP/UX A.09.01 (using gcc/g++ 2.6.0)
|
|
</li>
|
|
<li>
|
|
IRIX 5.3 (SGI C++ compiler. Don't know the version)
|
|
</li>
|
|
<li>
|
|
Debian Linux 2.0 (using egcs 1.1b)
|
|
</li>
|
|
-->
|
|
<li>
|
|
FreeBSD 4.6 (using gcc 2.95.3) <!-- lha -->
|
|
</li>
|
|
<li>
|
|
Mandrake Linux 8.2 (using gcc 3.2) <!-- lha -->
|
|
</li>
|
|
<li>
|
|
Debian, 2.2.19 kernel (using gcc 2.95.4) <!-- lha -->
|
|
</li>
|
|
<li>
|
|
Debian on an Alpha <!-- lha -->
|
|
</li>
|
|
<li>
|
|
RedHat 7.3, 8.0 <!-- Jim Cole -->
|
|
</li>
|
|
<li>
|
|
Sun Solaris 2.8 = SunOS 5.8 (using gcc 3.1) <!-- lha -->
|
|
</li>
|
|
<li>
|
|
Sun Solaris 2.8 = SunOS 5.8 (using Sun's cc / g++ 3.1) <!-- lha -->
|
|
</li>
|
|
<li>
|
|
Mac OS X 10.2 (using gcc) <!-- Jim Cole -->
|
|
</li>
|
|
|
|
</ul>
|
|
There are reports of ht://Dig working on a number of other platforms.
|
|
<h3>
|
|
libstdc++
|
|
</h3>
|
|
<p>
|
|
If you plan on using g++ to compile ht://Dig, you have to make
|
|
sure that libstdc++ has been installed. Unfortunately, libstdc++ is a
|
|
separate package from gcc/g++. You can get libstdc++ from the
|
|
<a href="ftp://ftp.gnu.org/pub/gnu/">GNU software archive</a>.
|
|
</p>
|
|
|
|
<!-- The current Makefiles don't use include...
|
|
<h3>
|
|
Berkeley 'make'
|
|
</h3>
|
|
<p>
|
|
The building relies heavily on the make program. The problem
|
|
with this is that not all make programs are the same. The
|
|
requirement for the make program is that it understands the
|
|
'include' statement as in
|
|
</p>
|
|
<blockquote>
|
|
<code>include somefile otherfile</code>
|
|
</blockquote>
|
|
<p>
|
|
The Berkeley 4.4 make program doesn't use this syntax, instead
|
|
it wants
|
|
</p>
|
|
<blockquote>
|
|
<code>.include "somefile"</code><br>
|
|
<code>.include "otherfile"</code>
|
|
</blockquote>
|
|
<p>
|
|
and hence it cannot be used to build ht://Dig.
|
|
</p>
|
|
<p>
|
|
If your make program doesn't understand the right 'include'
|
|
syntax, it is best if you get and install
|
|
<a href="ftp://ftp.gnu.org/pub/gnu/">gnumake</a> before you try
|
|
to compile everything. The alternative is to change all the
|
|
Makefiles.
|
|
</p>
|
|
-->
|
|
<hr noshade>
|
|
<h1>
|
|
Disk space requirements
|
|
</h1>
|
|
<p>
|
|
The search engine will require lots of disk space to store
|
|
its databases. Unfortunately, there is no exact formula to
|
|
compute the space requirements. It depends on the number of
|
|
documents you are going to index but also on the various
|
|
options you use.
|
|
</p>
|
|
<p>As a temporary measure, 3.2 betas use a very inefficient
|
|
database structure to enable phrase searching. This will be
|
|
fixed before the release of 3.2.0. Currently, indexing a site of
|
|
around 10,000 documents gives a database of around 400MB using the
|
|
default setting for
|
|
<a href="attrs.html#max_doc_size">maximum document size</a> and storing the
|
|
<a href="attrs.html#max_head_length">first 50,000 bytes of each document</a>
|
|
to enable context to be displayed.
|
|
<!-- To give you an idea of the space
|
|
requirements, here is what I have deduced from our own
|
|
database size at San Diego State University.
|
|
</p>
|
|
<p>
|
|
If you keep around the wordlist database (for update digging
|
|
instead of initial digging) I found that multiplying the
|
|
number of documents covered by 12,000 will come pretty close
|
|
to the space required.
|
|
</p>
|
|
<p>
|
|
We have about 13,000 documents:
|
|
</p>
|
|
<pre>
|
|
13,000
|
|
12,000 x
|
|
===========
|
|
156,000,000
|
|
</pre>
|
|
or about 150 MB.
|
|
<p>
|
|
Without the wordlist database, the factor drops down to about
|
|
7500:
|
|
</p>
|
|
<pre>
|
|
13,000
|
|
7,500 x
|
|
===========
|
|
97,500,000
|
|
</pre>
|
|
or about 93 MB.
|
|
-->
|
|
<p>
|
|
Keep in mind that we keep at most 50,000 bytes of each
|
|
document. This may seen a lot, but most documents aren't very
|
|
big and it gives us a big enough chunk to almost always show
|
|
an excerpt of the matches.
|
|
</p>
|
|
<p>
|
|
You may find that if you store most of each document, the
|
|
databases are almost the same size, or even larger than the
|
|
documents themselves! Remember that if you're storing a
|
|
significant portion of each document (say 50,000 bytes as
|
|
above), you have that requirement, plus the size of the word
|
|
database and all the additional information about each document
|
|
(size, URL, date, etc.) required for searching.
|
|
</p>
|
|
<hr size="4" noshade>
|
|
|
|
Last modified: $Date: 2004/05/28 13:15:19 $
|
|
|
|
</body>
|
|
</html>
|