kpilot/conduits/docconduit/bmkSpecification.txt

KPilot PalmDoc Conduit bookmark Specification
=============================================

(c) 2003 Reinhold Kainhofer, reinhold@kainhofer.com

This document is licensed under the FDL (Free Documentation License)
as published by the FSF. Any version of the FDL can be applied
at your convenience.


The PalmDoc conduit has three ways to indicate bookmarks for a text:
  -) Inline tags of the form <* bookmarkname *>
	-) Endtags of the form <bookmarkname> at the end of the document
	-) Regular expressions in a separate textname.bmk file
	   (textname.bmk ist the filename of the text with the .txt replaced by .bmk)


In the design of the .bmk file, I tried to stay close to the
syntac of MakeDocJ bookmark files, but it turned out that I
needed to extend the syntax a little. Also, MakeDocJ uses Java
RegExps, while the PalmDoc conduit uses the QRegExp, which have
some slight differences (especially concerning the ^ and $
patterns as well as backreferences). So if you used MakeDocJ,
the .bmk file syntax will be quite familiar, but you will still
have to adapt your bookmark files for TQt regular expressions
instead of Java regular expressions


1) INLINE TAGS

Whenever a tag of the form <* someText *> appears in the text,
this sequence is removed from the text, and a bookmark is set
there with the bookmark name "someText" (the part between the
<* and the *>).


2) ENDTAGS

If the text ends with tags of the form <someText>, the string
in braces is used as bookmark name, and wherever it appears in
the text, a bookmark is set.
After the > any number of whitespace is allowed, but no other
characters like letters, numbers, or punctuation. Also, inside
the braces no line break must occur. The conduit searches the
text from the end and if it finds a line break inside a <...>
sequence, the tag and everything before it is assumed to belong
to the text and doesn't form a bookmark tag.
Between endtags any number of whitespace (spaces, tabs, line
feeds etc.) is allowed.

As an example, assume you have a text ending in:
... the bad guy was punished, and they lived happily
ever after!
<Tag with
line feed>
       <bad guy> <princess>
<married>

The conduit starts at the end, ignores all whitespace between
the tags, so it finds the tags "married", "princess", and "bad guy".
The "Tag with line feed" has a line feed, so it is assumed to belong
to the text.
Assume now you have a text ending in:
... the bad guy was punished, and they lived happily
ever after!
<bad guy> The End <princess>
<married>

Here, only "married" and "princess" are found as bookmarks. Because
of the letters before the "princess" tags, the search for the
bookmarks ends at the letter "d" of "The End" (the conduit starts
from the end and moves backward until it finds some text which
cannot be seen as a endtag.


3) REGULAR EXPRESSIONS IN A SEPARATE FILE

This is by far the most complex way to specify bookmarks, but
it is also the mose powerful.
If you have a text with filename "My fairy tale.txt", the
bookmarks will be specified in a file called "My fairy tale.bmk"
(just the text filename with the .txt replaced by .bmk). This
file contains the bookmark definitions, one in each line. Lines
starting with a # are seen as comments, and empty lines are also
ignored.


In the .bmk file, each bookmark line has one of the following syntaces
(I will explain all fields later on). Fields in [..] are optional:

bmkName
bmkPosition, bmkName
+, bmkPatternRegExp[, bmkNameAsString[, firstIncludedBmk[, lastIncludedBmk]]]
+, bmkPatternRegExp[, bmkNameIndexOfSubexpression[, firstIncludedBmk[, lastIncludedBmk]]]
-, bmkPatternRegExp[, bmkNameAsString]
-, bmkPatternRegExp[, bmkNameIndexOfSubexpression]

  If the first field is a string, it is used as the bookmark name
and pattern to search for.
  If the first field is a number, it means the position of the
bookmark, and the second field is the name of the bookmark.
  If the first field is either + or -, the second field gives
a regular expression that is used to find the position of the
bookmark. If the first field is a -, the search is done only
once and only the first match will be added as bookmark. If
the first field is a +, the search is done until the regular
expression can no longer be found (the fourth and fifth fields
can be used to include only a certain range of hits). If there
is a third field, and it is a string, it gives the name of the
bookmark as a regular expression (i.e. \1 are replaced by the
first subexpression of the search, where subexpressions are
specified by round brackets in the regexp of the second field).
If there is a third field, and it is a number, it gives the index
of the subexpression of bmkPatternRegExp that is used as the
bookmark name.
If there is no third field, the whole matched text will be used
as bookmark name.
The optional fourth and fifth fields can be used to set bookmarks
only after the first few ocurrences of the regexp in the text, and
to stop the search after the expression has been found a certain
number of times.


If the PDB->PC sync is set up to store the bookmarks in a bookmark file,
it will create a file "My fairy tale.bm" (no "k") with entries of the form
position,bmkName
The .bmk file will be used if it exists, but if no .bmk file exists, the .bm file
will be used. This way you can override the bookmark settings, while
at the same time the PDB->TXT sync does not destroy your possibly
existing .bmk file.


Examples:

1) Imagine you have a line like:
frog princess
In this case, the text is searched for "frog princess", and a
bookmark is set whenever "frog princess" occurs in the text.
The name of each of these bookmarks will be "frog princess".

2) A bookmark line:
55, Bookmark at offset 55
Here, a bookmark will be set at offset 55 (55th character of
the text), and it will have the name "Bookmark at offs" (truncated
to 16 characters)

3) A bookmark line
-,Chapter \d+
causes a bookmark to be set at the first ocurrence of "Chapter XXX",
where XXX denotes one or more digits. The bookmark name will be
"Chapter XXX" (XXX replaced by the actual digits).

4) A bookmark line
+,Chapter \d+
causes bookmarks to be set wherever "Chapter XXX" (XXX being one
or more digits) appears in the text. The bookmark name will again
be "Chapter XXX", but the search does not stop after the first hit.

5) A bookmark line
+,\n\s*(Chapter \d+)\D+, 1
causes a bookmark to be set whenever a new line starts with
"Chapter XXX" (whitespace is allowed before the "Chapter"), and
uses the first subexpression in (..) as the bookmark name. If you
have a passage
     Chapter 15: here it starts
The regular expression will match, so a bookmark will be set there
and the subexpression "Chapter 15" (which matches the (Chapter \d+) )
will be used as bookmark text.

6) A bookmark line
+,\n\s*Part (\d+),\1\. part
sets a bookmark whenever a line starts with "Part XXX". The XXX
will be stored as the first matched subexpression. The third field
"\1\. part" is the regular expression for the bookmark name, where
\1 is replaced by the first matched subexpression of the search (XXX
in this case). So if a line starts with "   Part 17: ", the bookmark
name will be "17. part".

7) A bookmark line
+,Table (\d+): ,\1\. Tabelle,5,25
will match whenever "Table XXX: " appears in the text, and the bookmark
name will be "XXX. Tabelle". However, the fourth field means that the
first four hits are ignored (the 5th hit is the first hit to be included
as a bookmark), and the fifth field means that all further hits after the
25th will be ignored, too.

8) In law texts, I use a regular expression
+,\n *(<28>\.? *\d+[a-z]?\.?) +, 1
to search for all paragraphs starting like "<22>. 15. " or "  <20>23 ", and set
a bookmark there using only the part from the <20> to the last digit or the
full stop after the last digit (the pattern between the (), in our two
cases the bookmark names will be "<22>. 15." and "<22>23" ).