You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
665 lines
23 KiB
665 lines
23 KiB
15 years ago
|
<appendix id="regular-expressions">
|
||
|
<appendixinfo>
|
||
|
<authorgroup>
|
||
|
<author>&Anders.Lund; &Anders.Lund.mail;</author>
|
||
|
<!-- TRANS:ROLES_OF_TRANSLATORS -->
|
||
|
</authorgroup>
|
||
|
</appendixinfo>
|
||
|
|
||
|
<title>Regular Expressions</title>
|
||
|
|
||
|
<synopsis> This Appendix contains a brief but hopefully sufficient and
|
||
|
covering introduction to the world of <emphasis>regular
|
||
|
expressions</emphasis>. It documents regular expressions in the form
|
||
|
available within &kate;, which is not compatible with the regular
|
||
|
expressions of perl, nor with those of for example
|
||
|
<command>grep</command>.</synopsis>
|
||
|
|
||
|
<sect1>
|
||
|
|
||
|
<title>Introduction</title>
|
||
|
|
||
|
<para><emphasis>Regular Expressions</emphasis> provides us with a way
|
||
|
to describe some possible contents of a text string in a way
|
||
|
understood by a small piece of software, so that it can investigate if
|
||
|
a text matches, and also in the case of advanced applications with the
|
||
|
means of saving pieces or the matching text.</para>
|
||
|
|
||
|
<para>An example: Say you want to search a text for paragraphs that
|
||
|
starts with either of the names <quote>Henrik</quote> or
|
||
|
<quote>Pernille</quote> followed by some form of the verb
|
||
|
<quote>say</quote>.</para>
|
||
|
|
||
|
<para>With a normal search, you would start out searching for the
|
||
|
first name, <quote>Henrik</quote> maybe followed by <quote>sa</quote>
|
||
|
like this: <userinput>Henrik sa</userinput>, and while looking for
|
||
|
matches, you would have to discard those not being the beginning of a
|
||
|
paragraph, as well as those in which the word starting with the
|
||
|
letters <quote>sa</quote> was not either <quote>says</quote>,
|
||
|
<quote>said</quote> or so. And then of cause repeat all of that with
|
||
|
the next name...</para>
|
||
|
|
||
|
<para>With Regular Expressions, that task could be accomplished with a
|
||
|
single search, and with a larger degree of preciseness.</para>
|
||
|
|
||
|
<para>To achieve this, Regular Expressions defines rules for
|
||
|
expressing in details a generalization of a string to match. Our
|
||
|
example, which we might literally express like this: <quote>A line
|
||
|
starting with either <quote>Henrik</quote> or <quote>Pernille</quote>
|
||
|
(possibly following up to 4 blanks or tab characters) followed by a
|
||
|
whitespace followed by <quote>sa</quote> and then either
|
||
|
<quote>ys</quote> or <quote>id</quote></quote> could be expressed with
|
||
|
the following regular expression:</para> <para><userinput>^[
|
||
|
\t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput></para>
|
||
|
|
||
|
<para>The above example demonstrates all four major concepts of modern
|
||
|
Regular Expressions, namely:</para>
|
||
|
|
||
|
<itemizedlist>
|
||
|
<listitem><para>Patterns</para></listitem>
|
||
|
<listitem><para>Assertions</para></listitem>
|
||
|
<listitem><para>Quantifiers</para></listitem>
|
||
|
<listitem><para>Back references</para></listitem>
|
||
|
</itemizedlist>
|
||
|
|
||
|
<para>The caret (<literal>^</literal>) starting the expression is an
|
||
|
assertion, being true only if the following matching string is at the
|
||
|
start of a line.</para>
|
||
|
|
||
|
<para>The stings <literal>[ \t]</literal> and
|
||
|
<literal>(Henrik|Pernille) sa(ys|id)</literal> are patterns. The first
|
||
|
one is a <emphasis>character class</emphasis> that matches either a
|
||
|
blank or a (horizontal) tab character; the other pattern contains
|
||
|
first a subpattern matching either <literal>Henrik</literal>
|
||
|
<emphasis>or</emphasis> <literal>Pernille</literal>, then a piece
|
||
|
matching the exact string <literal> sa</literal> and finally a
|
||
|
subpattern matching either <literal>ys</literal>
|
||
|
<emphasis>or</emphasis> <literal>id</literal></para>
|
||
|
|
||
|
<para>The string <literal>{0,4}</literal> is a quantifier saying
|
||
|
<quote>anywhere from 0 up to 4 of the previous</quote>.</para>
|
||
|
|
||
|
<para>Because regular expression software supporting the concept of
|
||
|
<emphasis>back references</emphasis> saves the entire matching part of
|
||
|
the string as well as sub-patterns enclosed in parentheses, given some
|
||
|
means of access to those references, we could get our hands on either
|
||
|
the whole match (when searching a text document in an editor with a
|
||
|
regular expression, that is often marked as selected) or either the
|
||
|
name found, or the last part of the verb.</para>
|
||
|
|
||
|
<para>All together, the expression will match where we wanted it to,
|
||
|
and only there.</para>
|
||
|
|
||
|
<para>The following sections will describe in details how to construct
|
||
|
and use patterns, character classes, assertions, quantifiers and
|
||
|
back references, and the final section will give a few useful
|
||
|
examples.</para>
|
||
|
|
||
|
</sect1>
|
||
|
|
||
|
<sect1 id="regex-patterns">
|
||
|
|
||
|
<title>Patterns</title>
|
||
|
|
||
|
<para>Patterns consists of literal strings and character
|
||
|
classes. Patterns may contain sub-patterns, which are patterns enclosed
|
||
|
in parentheses.</para>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Escaping characters</title>
|
||
|
|
||
|
<para>In patterns as well as in character classes, some characters
|
||
|
have a special meaning. To literally match any of those characters,
|
||
|
they must be marked or <emphasis>escaped</emphasis> to let the regular
|
||
|
expression software know that it should interpret such characters in
|
||
|
their literal meaning.</para>
|
||
|
|
||
|
<para>This is done by prepending the character with a backslash
|
||
|
(<literal>\</literal>).</para>
|
||
|
|
||
|
|
||
|
<para>The regular expression software will silently ignore escaping a
|
||
|
character that does not have any special meaning in the context, so
|
||
|
escaping for example a <quote>j</quote> (<userinput>\j</userinput>) is
|
||
|
safe. If you are in doubt whether a character could have a special
|
||
|
meaning, you can therefore escape it safely.</para>
|
||
|
|
||
|
<para>Escaping of cause includes the backslash character it self, to
|
||
|
literally match a such, you would write
|
||
|
<userinput>\\</userinput>.</para>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>Character Classes and abbreviations</title>
|
||
|
|
||
|
<para>A <emphasis>character class</emphasis> is an expression that
|
||
|
matches one of a defined set of characters. In Regular Expressions,
|
||
|
character classes are defined by putting the legal characters for the
|
||
|
class in square brackets, <literal>[]</literal>, or by using one of
|
||
|
the abbreviated classes described below.</para>
|
||
|
|
||
|
<para>Simple character classes just contains one or more literal
|
||
|
characters, for example <userinput>[abc]</userinput> (matching either
|
||
|
of the letters <quote>a</quote>, <quote>b</quote> or <quote>c</quote>)
|
||
|
or <userinput>[0123456789]</userinput> (matching any digit).</para>
|
||
|
|
||
|
<para>Because letters and digits have a logical order, you can
|
||
|
abbreviate those by specifying ranges of them:
|
||
|
<userinput>[a-c]</userinput> is equal to <userinput>[abc]</userinput>
|
||
|
and <userinput>[0-9]</userinput> is equal to
|
||
|
<userinput>[0123456789]</userinput>. Combining these constructs, for
|
||
|
example <userinput>[a-fynot1-38]</userinput> is completely legal (the
|
||
|
last one would match, of cause, either of
|
||
|
<quote>a</quote>,<quote>b</quote>,<quote>c</quote>,<quote>d</quote>,
|
||
|
<quote>e</quote>,<quote>f</quote>,<quote>y</quote>,<quote>n</quote>,<quote>o</quote>,<quote>t</quote>,
|
||
|
<quote>1</quote>,<quote>2</quote>,<quote>3</quote> or
|
||
|
<quote>8</quote>).</para>
|
||
|
|
||
|
<para>As capital letters are different characters from their
|
||
|
non-capital equivalents, to create a caseless character class matching
|
||
|
<quote>a</quote> or <quote>b</quote>, in any case, you need to write it
|
||
|
<userinput>[aAbB]</userinput>.</para>
|
||
|
|
||
|
<para>It is of cause possible to create a <quote>negative</quote>
|
||
|
class matching as <quote>anything but</quote> To do so put a caret
|
||
|
(<literal>^</literal>) at the beginning of the class: </para>
|
||
|
|
||
|
<para><userinput>[^abc]</userinput> will match any character
|
||
|
<emphasis>but</emphasis> <quote>a</quote>, <quote>b</quote> or
|
||
|
<quote>c</quote>.</para>
|
||
|
|
||
|
<para>In addition to literal characters, some abbreviations are
|
||
|
defined, making life still a bit easier:
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\a</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> bell character (BEL, 0x07).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\f</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> form feed character (FF, 0x0C).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\n</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> line feed character (LF, 0x0A, Unix newline).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\r</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> carriage return character (CR, 0x0D).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\t</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> horizontal tab character (HT, 0x09).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\v</userinput></term>
|
||
|
<listitem><para> This matches the <acronym>ASCII</acronym> vertical tab character (VT, 0x0B).</para></listitem>
|
||
|
</varlistentry>
|
||
|
<varlistentry>
|
||
|
<term><userinput>\xhhhh</userinput></term>
|
||
|
|
||
|
<listitem><para> This matches the Unicode character corresponding to
|
||
|
the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;,
|
||
|
\zero ooo) matches the <acronym>ASCII</acronym>/Latin-1 character
|
||
|
corresponding to the octal number ooo (between 0 and
|
||
|
0377).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>.</userinput> (dot)</term>
|
||
|
<listitem><para> This matches any character (including newline).</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\d</userinput></term>
|
||
|
<listitem><para> This matches a digit. Equal to <literal>[0-9]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\D</userinput></term>
|
||
|
<listitem><para> This matches a non-digit. Equal to <literal>[^0-9]</literal> or <literal>[^\d]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\s</userinput></term>
|
||
|
<listitem><para> This matches a whitespace character. Practically equal to <literal>[ \t\n\r]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\S</userinput></term>
|
||
|
<listitem><para> This matches a non-whitespace. Practically equal to <literal>[^ \t\r\n]</literal>, and equal to <literal>[^\s]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\w</userinput></term>
|
||
|
<listitem><para>Matches any <quote>word character</quote> - in this case any letter or digit. Note that
|
||
|
underscore (<literal>_</literal>) is not matched, as is the case with perl regular expressions.
|
||
|
Equal to <literal>[a-zA-Z0-9]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\W</userinput></term>
|
||
|
<listitem><para>Matches any non-word character - anything but letters or numbers.
|
||
|
Equal to <literal>[^a-zA-Z0-9]</literal> or <literal>[^\w]</literal></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</para>
|
||
|
|
||
|
<para>The abbreviated classes can be put inside a custom class, for
|
||
|
example to match a word character, a blank or a dot, you could write
|
||
|
<userinput>[\w \.]</userinput></para>
|
||
|
|
||
|
<note> <para>The POSIX notation of classes, <userinput>[:<class
|
||
|
name>:]</userinput> is currently not supported.</para> </note>
|
||
|
|
||
|
<sect3>
|
||
|
<title>Characters with special meanings inside character classes</title>
|
||
|
|
||
|
<para>The following characters has a special meaning inside the
|
||
|
<quote>[]</quote> character class construct, and must be escaped to be
|
||
|
literally included in a class:</para>
|
||
|
|
||
|
<variablelist>
|
||
|
<varlistentry>
|
||
|
<term><userinput>]</userinput></term>
|
||
|
<listitem><para>Ends the character class. Must be escaped unless it is the very first character in the
|
||
|
class (may follow an unescaped caret)</para></listitem>
|
||
|
</varlistentry>
|
||
|
<varlistentry>
|
||
|
<term><userinput>^</userinput> (caret)</term>
|
||
|
<listitem><para>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para></listitem>
|
||
|
</varlistentry>
|
||
|
<varlistentry>
|
||
|
<term><userinput>-</userinput> (dash)</term>
|
||
|
<listitem><para>Denotes a logical range. Must always be escaped within a character class.</para></listitem>
|
||
|
</varlistentry>
|
||
|
<varlistentry>
|
||
|
<term><userinput>\</userinput> (backslash)</term>
|
||
|
<listitem><para>The escape character. Must always be escaped.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</sect3>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
|
||
|
<title>Alternatives: matching <quote>one of</quote></title>
|
||
|
|
||
|
<para>If you want to match one of a set of alternative patterns, you
|
||
|
can separate those with <literal>|</literal> (vertical bar character).</para>
|
||
|
|
||
|
<para>For example to find either <quote>John</quote> or <quote>Harry</quote> you would use an expression <userinput>John|Harry</userinput>.</para>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
|
||
|
<title>Sub Patterns</title>
|
||
|
|
||
|
<para><emphasis>Sub patterns</emphasis> are patterns enclosed in
|
||
|
parentheses, and they have several uses in the world of regular
|
||
|
expressions.</para>
|
||
|
|
||
|
<sect3>
|
||
|
|
||
|
<title>Specifying alternatives</title>
|
||
|
|
||
|
<para>You may use a sub pattern to group a set of alternatives within
|
||
|
a larger pattern. The alternatives are separated by the character
|
||
|
<quote>|</quote> (vertical bar).</para>
|
||
|
|
||
|
<para>For example to match either of the words <quote>int</quote>,
|
||
|
<quote>float</quote> or <quote>double</quote>, you could use the
|
||
|
pattern <userinput>int|float|double</userinput>. If you only want to
|
||
|
find one if it is followed by some whitespace and then some letters,
|
||
|
put the alternatives inside a subpattern:
|
||
|
<userinput>(int|float|double)\s+\w+</userinput>.</para>
|
||
|
|
||
|
</sect3>
|
||
|
|
||
|
<sect3>
|
||
|
|
||
|
<title>Capturing matching text (back references)</title>
|
||
|
|
||
|
<para>If you want to use a back reference, use a sub pattern to have
|
||
|
the desired part of the pattern remembered.</para>
|
||
|
|
||
|
<para>For example, it you want to find two occurrences of the same
|
||
|
word separated by a comma and possibly some whitespace, you could
|
||
|
write <userinput>(\w+),\s*\1</userinput>. The sub pattern
|
||
|
<literal>\w+</literal> would find a chunk of word characters, and the
|
||
|
entire expression would match if those were followed by a comma, 0 or
|
||
|
more whitespace and then an equal chunk of word characters. (The
|
||
|
string <literal>\1</literal> references <emphasis>the first sub pattern
|
||
|
enclosed in parentheses</emphasis>)</para>
|
||
|
|
||
|
<!-- <para>See also <link linkend="backreferences">Back references</link>.</para> -->
|
||
|
|
||
|
</sect3>
|
||
|
|
||
|
<sect3 id="lookahead-assertions">
|
||
|
<title>Lookahead Assertions</title>
|
||
|
|
||
|
<para>A lookahead assertion is a sub pattern, starting with either
|
||
|
<literal>?=</literal> or <literal>?!</literal>.</para>
|
||
|
|
||
|
<para>For example to match the literal string <quote>Bill</quote> but
|
||
|
only if not followed by <quote> Gates</quote>, you could use this
|
||
|
expression: <userinput>Bill(?! Gates)</userinput>. (This would find
|
||
|
<quote>Bill Clinton</quote> as well as <quote>Billy the kid</quote>,
|
||
|
but silently ignore the other matches.)</para>
|
||
|
|
||
|
<para>Sub patterns used for assertions are not captured.</para>
|
||
|
|
||
|
<para>See also <link linkend="assertions">Assertions</link></para>
|
||
|
|
||
|
</sect3>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
<sect2 id="special-characters-in-patterns">
|
||
|
<title>Characters with a special meaning inside patterns</title>
|
||
|
|
||
|
<para>The following characters have meaning inside a pattern, and
|
||
|
must be escaped if you want to literally match them:
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\</userinput> (backslash)</term>
|
||
|
<listitem><para>The escape character.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>^</userinput> (caret)</term>
|
||
|
<listitem><para>Asserts the beginning of the string.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>$</userinput></term>
|
||
|
<listitem><para>Asserts the end of string.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>()</userinput> (left and right parentheses)</term>
|
||
|
<listitem><para>Denotes sub patterns.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{}</userinput> (left and right curly braces)</term>
|
||
|
<listitem><para>Denotes numeric quantifiers.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>[]</userinput> (left and right square brackets)</term>
|
||
|
<listitem><para>Denotes character classes.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>|</userinput> (vertical bar)</term>
|
||
|
<listitem><para>logical OR. Separates alternatives.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>+</userinput> (plus sign)</term>
|
||
|
<listitem><para>Quantifier, 1 or more.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>*</userinput> (asterisk)</term>
|
||
|
<listitem><para>Quantifier, 0 or more.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>?</userinput> (question mark)</term>
|
||
|
<listitem><para>An optional character. Can be interpreted as a quantifier, 0 or 1.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</para>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
</sect1>
|
||
|
|
||
|
<sect1 id="quantifiers">
|
||
|
<title>Quantifiers</title>
|
||
|
|
||
|
<para><emphasis>Quantifiers</emphasis> allows a regular expression to
|
||
|
match a specified number or range of numbers of either a character,
|
||
|
character class or sub pattern.</para>
|
||
|
|
||
|
<para>Quantifiers are enclosed in curly brackets (<literal>{</literal>
|
||
|
and <literal>}</literal>) and have the general form
|
||
|
<literal>{[minimum-occurrences][,[maximum-occurrences]]}</literal>
|
||
|
</para>
|
||
|
|
||
|
<para>The usage is best explained by example:
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{1}</userinput></term>
|
||
|
<listitem><para>Exactly 1 occurrence</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{0,1}</userinput></term>
|
||
|
<listitem><para>Zero or 1 occurrences</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{,1}</userinput></term>
|
||
|
<listitem><para>The same, with less work;)</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{5,10}</userinput></term>
|
||
|
<listitem><para>At least 5 but maximum 10 occurrences.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>{5,}</userinput></term>
|
||
|
<listitem><para>At least 5 occurrences, no maximum.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</para>
|
||
|
|
||
|
<para>Additionally, there are some abbreviations:
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>*</userinput> (asterisk)</term>
|
||
|
<listitem><para>similar to <literal>{0,}</literal>, find any number of occurrences.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>+</userinput> (plus sign)</term>
|
||
|
<listitem><para>similar to <literal>{1,}</literal>, at least 1 occurrence.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>?</userinput> (question mark)</term>
|
||
|
<listitem><para>similar to <literal>{0,1}</literal>, zero or 1 occurrence.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</para>
|
||
|
|
||
|
<sect2>
|
||
|
|
||
|
<title>Greed</title>
|
||
|
|
||
|
<para>When using quantifiers with no maximum, regular expressions
|
||
|
defaults to match as much of the searched string as possible, commonly
|
||
|
known as <emphasis>greedy</emphasis> behavior.</para>
|
||
|
|
||
|
<para>Modern regular expression software provides the means of
|
||
|
<quote>turning off greediness</quote>, though in a graphical
|
||
|
environment it is up to the interface to provide you with access to
|
||
|
this feature. For example a search dialog providing a regular
|
||
|
expression search could have a check box labeled <quote>Minimal
|
||
|
matching</quote> as well as it ought to indicate if greediness is the
|
||
|
default behavior.</para>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
<sect2>
|
||
|
<title>In context examples</title>
|
||
|
|
||
|
<para>Here are a few examples of using quantifiers</para>
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>^\d{4,5}\s</userinput></term>
|
||
|
<listitem><para>Matches the digits in <quote>1234 go</quote> and <quote>12345 now</quote>, but neither in <quote>567 eleven</quote>
|
||
|
nor in <quote>223459 somewhere</quote></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\s+</userinput></term>
|
||
|
<listitem><para>Matches one or more whitespace characters</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>(bla){1,}</userinput></term>
|
||
|
<listitem><para>Matches all of <quote>blablabla</quote> and the <quote>bla</quote> in <quote>blackbird</quote> or <quote>tabla</quote></para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>/?></userinput></term>
|
||
|
<listitem><para>Matches <quote>/></quote> in <quote><closeditem/></quote> as well as
|
||
|
<quote>></quote> in <quote><openitem></quote>.</para></listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</sect2>
|
||
|
|
||
|
</sect1>
|
||
|
|
||
|
<sect1 id="assertions">
|
||
|
<title>Assertions</title>
|
||
|
|
||
|
<para><emphasis>Assertions</emphasis> allows a regular expression to
|
||
|
match only under certain controlled conditions.</para>
|
||
|
|
||
|
<para>An assertion does not need a character to match, it rather
|
||
|
investigates the surroundings of a possible match before acknowledging
|
||
|
it. For example the <emphasis>word boundary</emphasis> assertion does
|
||
|
not try to find a non word character opposite a word one at its
|
||
|
position, instead it makes sure that there is not a word
|
||
|
character. This means that the assertion can match where there is no
|
||
|
character, &ie; at the ends of a searched string.</para>
|
||
|
|
||
|
<para>Some assertions actually does have a pattern to match, but the
|
||
|
part of the string matching that will not be a part of the result of
|
||
|
the match of the full expression.</para>
|
||
|
|
||
|
<para>Regular Expressions as documented here supports the following
|
||
|
assertions:
|
||
|
|
||
|
<variablelist>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>^</userinput> (caret: beginning of
|
||
|
string)</term>
|
||
|
<listitem><para>Matches the beginning of the searched
|
||
|
string.</para> <para>The expression <userinput>^Peter</userinput> will
|
||
|
match at <quote>Peter</quote> in the string <quote>Peter, hey!</quote>
|
||
|
but not in <quote>Hey, Peter!</quote> </para> </listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>$</userinput> (end of string)</term>
|
||
|
<listitem><para>Matches the end of the searched string.</para>
|
||
|
|
||
|
<para>The expression <userinput>you\?$</userinput> will match at the
|
||
|
last you in the string <quote>You didn't do that, did you?</quote> but
|
||
|
nowhere in <quote>You didn't do that, right?</quote></para>
|
||
|
|
||
|
</listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\b</userinput> (word boundary)</term>
|
||
|
<listitem><para>Matches if there is a word character at one side and not a word character at the
|
||
|
other.</para>
|
||
|
<para>This is useful to find word ends, for example both ends to find
|
||
|
a whole word. The expression <userinput>\bin\b</userinput> will match
|
||
|
at the separate <quote>in</quote> in the string <quote>He came in
|
||
|
through the window</quote>, but not at the <quote>in</quote> in
|
||
|
<quote>window</quote>.</para></listitem>
|
||
|
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>\B</userinput> (non word boundary)</term>
|
||
|
<listitem><para>Matches wherever <quote>\b</quote> does not.</para>
|
||
|
<para>That means that it will match for example within words: The expression
|
||
|
<userinput>\Bin\B</userinput> will match at in <quote>window</quote> but not in <quote>integer</quote> or <quote>I'm in love</quote>.</para>
|
||
|
</listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>(?=PATTERN)</userinput> (Positive lookahead)</term>
|
||
|
<listitem><para>A lookahead assertion looks at the part of the string following a possible match.
|
||
|
The positive lookahead will prevent the string from matching if the text following the possible match
|
||
|
does not match the <emphasis>PATTERN</emphasis> of the assertion, but the text matched by that will
|
||
|
not be included in the result.</para>
|
||
|
<para>The expression <userinput>handy(?=\w)</userinput> will match at <quote>handy</quote> in
|
||
|
<quote>handyman</quote> but not in <quote>That came in handy!</quote></para>
|
||
|
</listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
<varlistentry>
|
||
|
<term><userinput>(?!PATTERN)</userinput> (Negative lookahead)</term>
|
||
|
|
||
|
<listitem><para>The negative lookahead prevents a possible match to be
|
||
|
acknowledged if the following part of the searched string does match
|
||
|
its <emphasis>PATTERN</emphasis>.</para>
|
||
|
<para>The expression <userinput>const \w+\b(?!\s*&)</userinput>
|
||
|
will match at <quote>const char</quote> in the string <quote>const
|
||
|
char* foo</quote> while it can not match <quote>const QString</quote>
|
||
|
in <quote>const QString& bar</quote> because the
|
||
|
<quote>&</quote> matches the negative lookahead assertion
|
||
|
pattern.</para>
|
||
|
</listitem>
|
||
|
</varlistentry>
|
||
|
|
||
|
</variablelist>
|
||
|
|
||
|
</para>
|
||
|
|
||
|
</sect1>
|
||
|
|
||
|
<!-- TODO sect1 id="backreferences">
|
||
|
|
||
|
<title>Back References</title>
|
||
|
|
||
|
<para></para>
|
||
|
|
||
|
</sect1 -->
|
||
|
|
||
|
</appendix>
|