You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1220 lines
24 KiB
1220 lines
24 KiB
<appendix id="regular-expressions">
|
|
<appendixinfo>
|
|
<authorgroup>
|
|
<author
|
|
>&Anders.Lund; &Anders.Lund.mail;</author>
|
|
<othercredit role="translator"
|
|
><firstname
|
|
>Malcolm</firstname
|
|
><surname
|
|
>Hunter</surname
|
|
><affiliation
|
|
><address
|
|
><email
|
|
>malcolm.hunter@gmx.co.uk</email
|
|
></address
|
|
></affiliation
|
|
><contrib
|
|
>Conversion to British English</contrib
|
|
></othercredit
|
|
>
|
|
</authorgroup>
|
|
</appendixinfo>
|
|
|
|
<title
|
|
>Regular Expressions</title>
|
|
|
|
<synopsis
|
|
>This Appendix contains a brief but hopefully sufficient and
|
|
covering introduction to the world of <emphasis
|
|
>regular
|
|
expressions</emphasis
|
|
>. It documents regular expressions in the form
|
|
available within &kate;, which is not compatible with the regular
|
|
expressions of perl, nor with those of for example
|
|
<command
|
|
>grep</command
|
|
>.</synopsis>
|
|
|
|
<sect1>
|
|
|
|
<title
|
|
>Introduction</title>
|
|
|
|
<para
|
|
><emphasis
|
|
>Regular Expressions</emphasis
|
|
> provides us with a way to describe some possible contents of a text string in a way understood by a small piece of software, so that it can investigate if a text matches, and also in the case of advanced applications with the means of saving pieces or the matching text.</para>
|
|
|
|
<para
|
|
>An example: Say you want to search a text for paragraphs that starts with either of the names <quote
|
|
>Henrik</quote
|
|
> or <quote
|
|
>Pernille</quote
|
|
> followed by some form of the verb <quote
|
|
>say</quote
|
|
>.</para>
|
|
|
|
<para
|
|
>With a normal search, you would start out searching for the first name, <quote
|
|
>Henrik</quote
|
|
> maybe followed by <quote
|
|
>sa</quote
|
|
> like this: <userinput
|
|
>Henrik sa</userinput
|
|
>, and while looking for matches, you would have to discard those not being the beginning of a paragraph, as well as those in which the word starting with the letters <quote
|
|
>sa</quote
|
|
> was not either <quote
|
|
>says</quote
|
|
>, <quote
|
|
>said</quote
|
|
> or so. And then of cause repeat all of that with the next name...</para>
|
|
|
|
<para
|
|
>With Regular Expressions, that task could be accomplished with a single search, and with a larger degree of preciseness.</para>
|
|
|
|
<para
|
|
>To achieve this, Regular Expressions defines rules for expressing in details a generalisation of a string to match. Our example, which we might literally express like this: <quote
|
|
>A line starting with either <quote
|
|
>Henrik</quote
|
|
> or <quote
|
|
>Pernille</quote
|
|
> (possibly following up to 4 blanks or tab characters) followed by a whitespace followed by <quote
|
|
>sa</quote
|
|
> and then either <quote
|
|
>ys</quote
|
|
> or <quote
|
|
>id</quote
|
|
></quote
|
|
> could be expressed with the following regular expression:</para
|
|
> <para
|
|
><userinput
|
|
>^[ \t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput
|
|
></para>
|
|
|
|
<para
|
|
>The above example demonstrates all four major concepts of modern Regular Expressions, namely:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem
|
|
><para
|
|
>Patterns</para
|
|
></listitem>
|
|
<listitem
|
|
><para
|
|
>Assertions</para
|
|
></listitem>
|
|
<listitem
|
|
><para
|
|
>Quantifiers</para
|
|
></listitem>
|
|
<listitem
|
|
><para
|
|
>Back references</para
|
|
></listitem>
|
|
</itemizedlist>
|
|
|
|
<para
|
|
>The caret (<literal
|
|
>^</literal
|
|
>) starting the expression is an assertion, being true only if the following matching string is at the start of a line.</para>
|
|
|
|
<para
|
|
>The stings <literal
|
|
>[ \t]</literal
|
|
> and <literal
|
|
>(Henrik|Pernille) sa(ys|id)</literal
|
|
> are patterns. The first one is a <emphasis
|
|
>character class</emphasis
|
|
> that matches either a blank or a (horizontal) tab character; the other pattern contains first a subpattern matching either <literal
|
|
>Henrik</literal
|
|
> <emphasis
|
|
>or</emphasis
|
|
> <literal
|
|
>Pernille</literal
|
|
>, then a piece matching the exact string <literal
|
|
> sa</literal
|
|
> and finally a subpattern matching either <literal
|
|
>ys</literal
|
|
> <emphasis
|
|
>or</emphasis
|
|
> <literal
|
|
>id</literal
|
|
></para>
|
|
|
|
<para
|
|
>The string <literal
|
|
>{0,4}</literal
|
|
> is a quantifier saying <quote
|
|
>anywhere from 0 up to 4 of the previous</quote
|
|
>.</para>
|
|
|
|
<para
|
|
>Because regular expression software supporting the concept of <emphasis
|
|
>back references</emphasis
|
|
> saves the entire matching part of the string as well as sub-patterns enclosed in parentheses, given some means of access to those references, we could get our hands on either the whole match (when searching a text document in an editor with a regular expression, that is often marked as selected) or either the name found, or the last part of the verb.</para>
|
|
|
|
<para
|
|
>All together, the expression will match where we wanted it to, and only there.</para>
|
|
|
|
<para
|
|
>The following sections will describe in details how to construct and use patterns, character classes, assertions, quantifiers and back references, and the final section will give a few useful examples.</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="regex-patterns">
|
|
|
|
<title
|
|
>Patterns</title>
|
|
|
|
<para
|
|
>Patterns consists of literal strings and character classes. Patterns may contain sub-patterns, which are patterns enclosed in parentheses.</para>
|
|
|
|
<sect2>
|
|
<title
|
|
>Escaping characters</title>
|
|
|
|
<para
|
|
>In patterns as well as in character classes, some characters have a special meaning. To literally match any of those characters, they must be marked or <emphasis
|
|
>escaped</emphasis
|
|
> to let the regular expression software know that it should interpret such characters in their literal meaning.</para>
|
|
|
|
<para
|
|
>This is done by prepending the character with a backslash (<literal
|
|
>\</literal
|
|
>).</para>
|
|
|
|
|
|
<para
|
|
>The regular expression software will silently ignore escaping a character that does not have any special meaning in the context, so escaping for example a <quote
|
|
>j</quote
|
|
> (<userinput
|
|
>\j</userinput
|
|
>) is safe. If you are in doubt whether a character could have a special meaning, you can therefore escape it safely.</para>
|
|
|
|
<para
|
|
>Escaping of cause includes the backslash character it self, to literally match a such, you would write <userinput
|
|
>\\</userinput
|
|
>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title
|
|
>Character Classes and abbreviations</title>
|
|
|
|
<para
|
|
>A <emphasis
|
|
>character class</emphasis
|
|
> is an expression that matches one of a defined set of characters. In Regular Expressions, character classes are defined by putting the legal characters for the class in square brackets, <literal
|
|
>[]</literal
|
|
>, or by using one of the abbreviated classes described below.</para>
|
|
|
|
<para
|
|
>Simple character classes just contains one or more literal characters, for example <userinput
|
|
>[abc]</userinput
|
|
> (matching either of the letters <quote
|
|
>a</quote
|
|
>, <quote
|
|
>b</quote
|
|
> or <quote
|
|
>c</quote
|
|
>) or <userinput
|
|
>[0123456789]</userinput
|
|
> (matching any digit).</para>
|
|
|
|
<para
|
|
>Because letters and digits have a logical order, you can abbreviate those by specifying ranges of them: <userinput
|
|
>[a-c]</userinput
|
|
> is equal to <userinput
|
|
>[abc]</userinput
|
|
> and <userinput
|
|
>[0-9]</userinput
|
|
> is equal to <userinput
|
|
>[0123456789]</userinput
|
|
>. Combining these constructs, for example <userinput
|
|
>[a-fynot1-38]</userinput
|
|
> is completely legal (the last one would match, of cause, either of <quote
|
|
>a</quote
|
|
>,<quote
|
|
>b</quote
|
|
>,<quote
|
|
>c</quote
|
|
>,<quote
|
|
>d</quote
|
|
>, <quote
|
|
>e</quote
|
|
>,<quote
|
|
>f</quote
|
|
>,<quote
|
|
>y</quote
|
|
>,<quote
|
|
>n</quote
|
|
>,<quote
|
|
>o</quote
|
|
>,<quote
|
|
>t</quote
|
|
>, <quote
|
|
>1</quote
|
|
>,<quote
|
|
>2</quote
|
|
>,<quote
|
|
>3</quote
|
|
> or <quote
|
|
>8</quote
|
|
>).</para>
|
|
|
|
<para
|
|
>As capital letters are different characters from their non-capital equivalents, to create a caseless character class matching <quote
|
|
>a</quote
|
|
> or <quote
|
|
>b</quote
|
|
>, in any case, you need to write it <userinput
|
|
>[aAbB]</userinput
|
|
>.</para>
|
|
|
|
<para
|
|
>It is of cause possible to create a <quote
|
|
>negative</quote
|
|
> class matching as <quote
|
|
>anything but</quote
|
|
> To do so put a caret (<literal
|
|
>^</literal
|
|
>) at the beginning of the class: </para>
|
|
|
|
<para
|
|
><userinput
|
|
>[^abc]</userinput
|
|
> will match any character <emphasis
|
|
>but</emphasis
|
|
> <quote
|
|
>a</quote
|
|
>, <quote
|
|
>b</quote
|
|
> or <quote
|
|
>c</quote
|
|
>.</para>
|
|
|
|
<para
|
|
>In addition to literal characters, some abbreviations are defined, making life still a bit easier: <variablelist>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\a</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> bell character (BEL, 0x07).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\f</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> form feed character (FF, 0x0C).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\n</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> line feed character (LF, 0x0A, Unix newline).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\r</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> carriage return character (CR, 0x0D).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\t</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> horizontal tab character (HT, 0x09).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\v</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches the <acronym
|
|
>ASCII</acronym
|
|
> vertical tab character (VT, 0x0B).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\xhhhh</userinput
|
|
></term>
|
|
|
|
<listitem
|
|
><para
|
|
>This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;, \zero ooo) matches the <acronym
|
|
>ASCII</acronym
|
|
>/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>.</userinput
|
|
> (dot)</term>
|
|
<listitem
|
|
><para
|
|
>This matches any character (including newline).</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\d</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches a digit. Equal to <literal
|
|
>[0-9]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\D</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches a non-digit. Equal to <literal
|
|
>[^0-9]</literal
|
|
> or <literal
|
|
>[^\d]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\s</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches a whitespace character. Practically equal to <literal
|
|
>[ \t\n\r]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\S</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>This matches a non-whitespace. Practically equal to <literal
|
|
>[^ \t\r\n]</literal
|
|
>, and equal to <literal
|
|
>[^\s]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\w</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches any <quote
|
|
>word character</quote
|
|
> - in this case any letter or digit. Note that underscore (<literal
|
|
>_</literal
|
|
>) is not matched, as is the case with perl regular expressions. Equal to <literal
|
|
>[a-zA-Z0-9]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\W</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches any non-word character - anything but letters or numbers. Equal to <literal
|
|
>[^a-zA-Z0-9]</literal
|
|
> or <literal
|
|
>[^\w]</literal
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
<para
|
|
>The abbreviated classes can be put inside a custom class, for example to match a word character, a blank or a dot, you could write <userinput
|
|
>[\w \.]</userinput
|
|
></para
|
|
>
|
|
|
|
<note
|
|
> <para
|
|
>The POSIX notation of classes, <userinput
|
|
>[:<class name>:]</userinput
|
|
> is currently not supported.</para
|
|
> </note>
|
|
|
|
<sect3>
|
|
<title
|
|
>Characters with special meanings inside character classes</title>
|
|
|
|
<para
|
|
>The following characters has a special meaning inside the <quote
|
|
>[]</quote
|
|
> character class construct, and must be escaped to be literally included in a class:</para>
|
|
|
|
<variablelist>
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>]</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Ends the character class. Must be escaped unless it is the very first character in the class (may follow an unescaped caret)</para
|
|
></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>^</userinput
|
|
> (caret)</term>
|
|
<listitem
|
|
><para
|
|
>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para
|
|
></listitem
|
|
>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>-</userinput
|
|
> (dash)</term>
|
|
<listitem
|
|
><para
|
|
>Denotes a logical range. Must always be escaped within a character class.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\</userinput
|
|
> (backslash)</term>
|
|
<listitem
|
|
><para
|
|
>The escape character. Must always be escaped.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
|
|
<title
|
|
>Alternatives: matching <quote
|
|
>one of</quote
|
|
></title>
|
|
|
|
<para
|
|
>If you want to match one of a set of alternative patterns, you can separate those with <literal
|
|
>|</literal
|
|
> (vertical bar character).</para>
|
|
|
|
<para
|
|
>For example to find either <quote
|
|
>John</quote
|
|
> or <quote
|
|
>Harry</quote
|
|
> you would use an expression <userinput
|
|
>John|Harry</userinput
|
|
>.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
|
|
<title
|
|
>Sub Patterns</title>
|
|
|
|
<para
|
|
><emphasis
|
|
>Sub patterns</emphasis
|
|
> are patterns enclosed in parentheses, and they have several uses in the world of regular expressions.</para>
|
|
|
|
<sect3>
|
|
|
|
<title
|
|
>Specifying alternatives</title>
|
|
|
|
<para
|
|
>You may use a sub pattern to group a set of alternatives within a larger pattern. The alternatives are separated by the character <quote
|
|
>|</quote
|
|
> (vertical bar).</para>
|
|
|
|
<para
|
|
>For example to match either of the words <quote
|
|
>int</quote
|
|
>, <quote
|
|
>float</quote
|
|
> or <quote
|
|
>double</quote
|
|
>, you could use the pattern <userinput
|
|
>int|float|double</userinput
|
|
>. If you only want to find one if it is followed by some whitespace and then some letters, put the alternatives inside a subpattern: <userinput
|
|
>(int|float|double)\s+\w+</userinput
|
|
>.</para>
|
|
|
|
</sect3>
|
|
|
|
<sect3>
|
|
|
|
<title
|
|
>Capturing matching text (back references)</title>
|
|
|
|
<para
|
|
>If you want to use a back reference, use a sub pattern to have the desired part of the pattern remembered.</para>
|
|
|
|
<para
|
|
>For example, it you want to find two occurrences of the same word separated by a comma and possibly some whitespace, you could write <userinput
|
|
>(\w+),\s*\1</userinput
|
|
>. The sub pattern <literal
|
|
>\w+</literal
|
|
> would find a chunk of word characters, and the entire expression would match if those were followed by a comma, 0 or more whitespace and then an equal chunk of word characters. (The string <literal
|
|
>\1</literal
|
|
> references <emphasis
|
|
>the first sub pattern enclosed in parentheses</emphasis
|
|
>)</para>
|
|
|
|
<!-- <para
|
|
>See also <link linkend="backreferences"
|
|
>Back references</link
|
|
>.</para
|
|
> -->
|
|
|
|
</sect3>
|
|
|
|
<sect3 id="lookahead-assertions">
|
|
<title
|
|
>Lookahead Assertions</title>
|
|
|
|
<para
|
|
>A lookahead assertion is a sub pattern, starting with either <literal
|
|
>?=</literal
|
|
> or <literal
|
|
>?!</literal
|
|
>.</para>
|
|
|
|
<para
|
|
>For example to match the literal string <quote
|
|
>Bill</quote
|
|
> but only if not followed by <quote
|
|
> Gates</quote
|
|
>, you could use this expression: <userinput
|
|
>Bill(?! Gates)</userinput
|
|
>. (This would find <quote
|
|
>Bill Clinton</quote
|
|
> as well as <quote
|
|
>Billy the kid</quote
|
|
>, but silently ignore the other matches.)</para>
|
|
|
|
<para
|
|
>Sub patterns used for assertions are not captured.</para>
|
|
|
|
<para
|
|
>See also <link linkend="assertions"
|
|
>Assertions</link
|
|
></para>
|
|
|
|
</sect3>
|
|
|
|
</sect2>
|
|
|
|
<sect2 id="special-characters-in-patterns">
|
|
<title
|
|
>Characters with a special meaning inside patterns</title>
|
|
|
|
<para
|
|
>The following characters have meaning inside a pattern, and must be escaped if you want to literally match them: <variablelist>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\</userinput
|
|
> (backslash)</term>
|
|
<listitem
|
|
><para
|
|
>The escape character.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>^</userinput
|
|
> (caret)</term>
|
|
<listitem
|
|
><para
|
|
>Asserts the beginning of the string.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>$</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Asserts the end of string.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>()</userinput
|
|
> (left and right parentheses)</term>
|
|
<listitem
|
|
><para
|
|
>Denotes sub patterns.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{}</userinput
|
|
> (left and right curly braces)</term>
|
|
<listitem
|
|
><para
|
|
>Denotes numeric quantifiers.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>[]</userinput
|
|
> (left and right square brackets)</term>
|
|
<listitem
|
|
><para
|
|
>Denotes character classes.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>|</userinput
|
|
> (vertical bar)</term>
|
|
<listitem
|
|
><para
|
|
>logical OR. Separates alternatives.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>+</userinput
|
|
> (plus sign)</term>
|
|
<listitem
|
|
><para
|
|
>Quantifier, 1 or more.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>*</userinput
|
|
> (asterisk)</term>
|
|
<listitem
|
|
><para
|
|
>Quantifier, 0 or more.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>?</userinput
|
|
> (question mark)</term>
|
|
<listitem
|
|
><para
|
|
>An optional character. Can be interpreted as a quantifier, 0 or 1.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="quantifiers">
|
|
<title
|
|
>Quantifiers</title>
|
|
|
|
<para
|
|
><emphasis
|
|
>Quantifiers</emphasis
|
|
> allows a regular expression to match a specified number or range of numbers of either a character, character class or sub pattern.</para>
|
|
|
|
<para
|
|
>Quantifiers are enclosed in curly brackets (<literal
|
|
>{</literal
|
|
> and <literal
|
|
>}</literal
|
|
>) and have the general form <literal
|
|
>{[minimum-occurrences][,[maximum-occurrences]]}</literal
|
|
> </para>
|
|
|
|
<para
|
|
>The usage is best explained by example: <variablelist>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{1}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Exactly 1 occurrence</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{0,1}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Zero or 1 occurrences</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{,1}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>The same, with less work;)</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{5,10}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>At least 5 but maximum 10 occurrences.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>{5,}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>At least 5 occurrences, no maximum.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
<para
|
|
>Additionally, there are some abbreviations: <variablelist>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>*</userinput
|
|
> (asterisk)</term>
|
|
<listitem
|
|
><para
|
|
>similar to <literal
|
|
>{0,}</literal
|
|
>, find any number of occurrences.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>+</userinput
|
|
> (plus sign)</term>
|
|
<listitem
|
|
><para
|
|
>similar to <literal
|
|
>{1,}</literal
|
|
>, at least 1 occurrence.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>?</userinput
|
|
> (question mark)</term>
|
|
<listitem
|
|
><para
|
|
>similar to <literal
|
|
>{0,1}</literal
|
|
>, zero or 1 occurrence.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
<sect2>
|
|
|
|
<title
|
|
>Greed</title>
|
|
|
|
<para
|
|
>When using quantifiers with no maximum, regular expressions defaults to match as much of the searched string as possible, commonly known as <emphasis
|
|
>greedy</emphasis
|
|
> behaviour.</para>
|
|
|
|
<para
|
|
>Modern regular expression software provides the means of <quote
|
|
>turning off greediness</quote
|
|
>, though in a graphical environment it is up to the interface to provide you with access to this feature. For example a search dialogue providing a regular expression search could have a check box labelled <quote
|
|
>Minimal matching</quote
|
|
> as well as it ought to indicate if greediness is the default behaviour.</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title
|
|
>In context examples</title>
|
|
|
|
<para
|
|
>Here are a few examples of using quantifiers</para>
|
|
|
|
<variablelist>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>^\d{4,5}\s</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches the digits in <quote
|
|
>1234 go</quote
|
|
> and <quote
|
|
>12345 now</quote
|
|
>, but neither in <quote
|
|
>567 eleven</quote
|
|
> nor in <quote
|
|
>223459 somewhere</quote
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\s+</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches one or more whitespace characters</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>(bla){1,}</userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches all of <quote
|
|
>blablabla</quote
|
|
> and the <quote
|
|
>bla</quote
|
|
> in <quote
|
|
>blackbird</quote
|
|
> or <quote
|
|
>tabla</quote
|
|
></para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>/?></userinput
|
|
></term>
|
|
<listitem
|
|
><para
|
|
>Matches <quote
|
|
>/></quote
|
|
> in <quote
|
|
><closeditem/></quote
|
|
> as well as <quote
|
|
>></quote
|
|
> in <quote
|
|
><openitem></quote
|
|
>.</para
|
|
></listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</sect2>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="assertions">
|
|
<title
|
|
>Assertions</title>
|
|
|
|
<para
|
|
><emphasis
|
|
>Assertions</emphasis
|
|
> allows a regular expression to match only under certain controlled conditions.</para>
|
|
|
|
<para
|
|
>An assertion does not need a character to match, it rather investigates the surroundings of a possible match before acknowledging it. For example the <emphasis
|
|
>word boundary</emphasis
|
|
> assertion does not try to find a non word character opposite a word one at its position, instead it makes sure that there is not a word character. This means that the assertion can match where there is no character, &ie; at the ends of a searched string.</para>
|
|
|
|
<para
|
|
>Some assertions actually does have a pattern to match, but the part of the string matching that will not be a part of the result of the match of the full expression.</para>
|
|
|
|
<para
|
|
>Regular Expressions as documented here supports the following assertions: <variablelist>
|
|
|
|
<varlistentry
|
|
>
|
|
<term
|
|
><userinput
|
|
>^</userinput
|
|
> (caret: beginning of string)</term
|
|
>
|
|
<listitem
|
|
><para
|
|
>Matches the beginning of the searched string.</para
|
|
> <para
|
|
>The expression <userinput
|
|
>^Peter</userinput
|
|
> will match at <quote
|
|
>Peter</quote
|
|
> in the string <quote
|
|
>Peter, hey!</quote
|
|
> but not in <quote
|
|
>Hey, Peter!</quote
|
|
> </para
|
|
> </listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>$</userinput
|
|
> (end of string)</term>
|
|
<listitem
|
|
><para
|
|
>Matches the end of the searched string.</para>
|
|
|
|
<para
|
|
>The expression <userinput
|
|
>you\?$</userinput
|
|
> will match at the last you in the string <quote
|
|
>You didn't do that, did you?</quote
|
|
> but nowhere in <quote
|
|
>You didn't do that, right?</quote
|
|
></para>
|
|
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\b</userinput
|
|
> (word boundary)</term>
|
|
<listitem
|
|
><para
|
|
>Matches if there is a word character at one side and not a word character at the other.</para>
|
|
<para
|
|
>This is useful to find word ends, for example both ends to find a whole word. The expression <userinput
|
|
>\bin\b</userinput
|
|
> will match at the separate <quote
|
|
>in</quote
|
|
> in the string <quote
|
|
>He came in through the window</quote
|
|
>, but not at the <quote
|
|
>in</quote
|
|
> in <quote
|
|
>window</quote
|
|
>.</para
|
|
></listitem>
|
|
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>\B</userinput
|
|
> (non word boundary)</term>
|
|
<listitem
|
|
><para
|
|
>Matches wherever <quote
|
|
>\b</quote
|
|
> does not.</para>
|
|
<para
|
|
>That means that it will match for example within words: The expression <userinput
|
|
>\Bin\B</userinput
|
|
> will match at in <quote
|
|
>window</quote
|
|
> but not in <quote
|
|
>integer</quote
|
|
> or <quote
|
|
>I'm in love</quote
|
|
>.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>(?=PATTERN)</userinput
|
|
> (Positive lookahead)</term>
|
|
<listitem
|
|
><para
|
|
>A lookahead assertion looks at the part of the string following a possible match. The positive lookahead will prevent the string from matching if the text following the possible match does not match the <emphasis
|
|
>PATTERN</emphasis
|
|
> of the assertion, but the text matched by that will not be included in the result.</para>
|
|
<para
|
|
>The expression <userinput
|
|
>handy(?=\w)</userinput
|
|
> will match at <quote
|
|
>handy</quote
|
|
> in <quote
|
|
>handyman</quote
|
|
> but not in <quote
|
|
>That came in handy!</quote
|
|
></para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
<varlistentry>
|
|
<term
|
|
><userinput
|
|
>(?!PATTERN)</userinput
|
|
> (Negative lookahead)</term>
|
|
|
|
<listitem
|
|
><para
|
|
>The negative lookahead prevents a possible match to be acknowledged if the following part of the searched string does match its <emphasis
|
|
>PATTERN</emphasis
|
|
>.</para>
|
|
<para
|
|
>The expression <userinput
|
|
>const \w+\b(?!\s*&)</userinput
|
|
> will match at <quote
|
|
>const char</quote
|
|
> in the string <quote
|
|
>const char* foo</quote
|
|
> while it can not match <quote
|
|
>const QString</quote
|
|
> in <quote
|
|
>const QString& bar</quote
|
|
> because the <quote
|
|
>&</quote
|
|
> matches the negative lookahead assertion pattern.</para>
|
|
</listitem>
|
|
</varlistentry>
|
|
|
|
</variablelist>
|
|
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<!-- TODO sect1 id="backreferences">
|
|
|
|
<title
|
|
>Back References</title>
|
|
|
|
<para
|
|
></para>
|
|
|
|
</sect1 -->
|
|
|
|
</appendix>
|