tde-i18n/tde-i18n-en_GB/docs/tdebase/kate/regular-expressions.docbook

<appendix id="regular-expressions">
<appendixinfo>
<authorgroup>
<author
>&Anders.Lund; &Anders.Lund.mail;</author>
<othercredit role="translator"
><firstname
>Malcolm</firstname
><surname
>Hunter</surname
><affiliation
><address
><email
>malcolm.hunter@gmx.co.uk</email
></address
></affiliation
><contrib
>Conversion to British English</contrib
></othercredit
>
</authorgroup>
</appendixinfo>

<title
>Regular Expressions</title>

<synopsis
>This Appendix contains a brief but hopefully sufficient and
covering introduction to the world of <emphasis
>regular
expressions</emphasis
>. It documents regular expressions in the form
available within &kate;, which is not compatible with the regular
expressions of perl, nor with those of for example
<command
>grep</command
>.</synopsis>

<sect1>

<title
>Introduction</title>

<para
><emphasis
>Regular Expressions</emphasis
> provides us with a way to describe some possible contents of a text string in a way understood by a small piece of software, so that it can investigate if a text matches, and also in the case of advanced applications with the means of saving pieces or the matching text.</para>

<para
>An example: Say you want to search a text for paragraphs that starts with either of the names <quote
>Henrik</quote
> or <quote
>Pernille</quote
> followed by some form of the verb <quote
>say</quote
>.</para>

<para
>With a normal search, you would start out searching for the first name, <quote
>Henrik</quote
> maybe followed by <quote
>sa</quote
> like this: <userinput
>Henrik sa</userinput
>, and while looking for matches, you would have to discard those not being the beginning of a paragraph, as well as those in which the word starting with the letters <quote
>sa</quote
> was not either <quote
>says</quote
>, <quote
>said</quote
> or so. And then of cause repeat all of that with the next name...</para>

<para
>With Regular Expressions, that task could be accomplished with a single search, and with a larger degree of preciseness.</para>

<para
>To achieve this, Regular Expressions defines rules for expressing in details a generalisation of a string to match. Our example, which we might literally express like this: <quote
>A line starting with either <quote
>Henrik</quote
> or <quote
>Pernille</quote
> (possibly following up to 4 blanks or tab characters) followed by a whitespace followed by <quote
>sa</quote
> and then either <quote
>ys</quote
> or <quote
>id</quote
></quote
> could be expressed with the following regular expression:</para
> <para
><userinput
>^[ \t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput
></para>

<para
>The above example demonstrates all four major concepts of modern Regular Expressions, namely:</para>

<itemizedlist>
<listitem
><para
>Patterns</para
></listitem>
<listitem
><para
>Assertions</para
></listitem>
<listitem
><para
>Quantifiers</para
></listitem>
<listitem
><para
>Back references</para
></listitem>
</itemizedlist>

<para
>The caret (<literal
>^</literal
>) starting the expression is an assertion, being true only if the following matching string is at the start of a line.</para>

<para
>The stings <literal
>[ \t]</literal
> and <literal
>(Henrik|Pernille) sa(ys|id)</literal
> are patterns. The first one is a <emphasis
>character class</emphasis
> that matches either a blank or a (horizontal) tab character; the other pattern contains first a subpattern matching either <literal
>Henrik</literal
> <emphasis
>or</emphasis
> <literal
>Pernille</literal
>, then a piece matching the exact string <literal
> sa</literal
> and finally a subpattern matching either <literal
>ys</literal
> <emphasis
>or</emphasis
> <literal
>id</literal
></para>

<para
>The string <literal
>{0,4}</literal
> is a quantifier saying <quote
>anywhere from 0 up to 4 of the previous</quote
>.</para>

<para
>Because regular expression software supporting the concept of <emphasis
>back references</emphasis
> saves the entire matching part of the string as well as sub-patterns enclosed in parentheses, given some means of access to those references, we could get our hands on either the whole match (when searching a text document in an editor with a regular expression, that is often marked as selected) or either the name found, or the last part of the verb.</para>

<para
>All together, the expression will match where we wanted it to, and only there.</para>

<para
>The following sections will describe in details how to construct and use patterns, character classes, assertions, quantifiers and back references, and the final section will give a few useful examples.</para>

</sect1>

<sect1 id="regex-patterns">

<title
>Patterns</title>

<para
>Patterns consists of literal strings and character classes. Patterns may contain sub-patterns, which are patterns enclosed in parentheses.</para>

<sect2>
<title
>Escaping characters</title>

<para
>In patterns as well as in character classes, some characters have a special meaning. To literally match any of those characters, they must be marked or <emphasis
>escaped</emphasis
> to let the regular expression software know that it should interpret such characters in their literal meaning.</para>

<para
>This is done by prepending the character with a backslash (<literal
>\</literal
>).</para>


<para
>The regular expression software will silently ignore escaping a character that does not have any special meaning in the context, so escaping for example a <quote
>j</quote
> (<userinput
>\j</userinput
>) is safe. If you are in doubt whether a character could have a special meaning, you can therefore escape it safely.</para>

<para
>Escaping of cause includes the backslash character it self, to literally match a such, you would write <userinput
>\\</userinput
>.</para>

</sect2>

<sect2>
<title
>Character Classes and abbreviations</title>

<para
>A <emphasis
>character class</emphasis
> is an expression that matches one of a defined set of characters. In Regular Expressions, character classes are defined by putting the legal characters for the class in square brackets, <literal
>[]</literal
>, or by using one of the abbreviated classes described below.</para>

<para
>Simple character classes just contains one or more literal characters, for example <userinput
>[abc]</userinput
> (matching either of the letters <quote
>a</quote
>, <quote
>b</quote
> or <quote
>c</quote
>) or <userinput
>[0123456789]</userinput
> (matching any digit).</para>

<para
>Because letters and digits have a logical order, you can abbreviate those by specifying ranges of them: <userinput
>[a-c]</userinput
> is equal to <userinput
>[abc]</userinput
> and <userinput
>[0-9]</userinput
> is equal to <userinput
>[0123456789]</userinput
>. Combining these constructs, for example <userinput
>[a-fynot1-38]</userinput
> is completely legal (the last one would match, of cause, either of <quote
>a</quote
>,<quote
>b</quote
>,<quote
>c</quote
>,<quote
>d</quote
>, <quote
>e</quote
>,<quote
>f</quote
>,<quote
>y</quote
>,<quote
>n</quote
>,<quote
>o</quote
>,<quote
>t</quote
>, <quote
>1</quote
>,<quote
>2</quote
>,<quote
>3</quote
> or <quote
>8</quote
>).</para>

<para
>As capital letters are different characters from their non-capital equivalents, to create a caseless character class matching <quote
>a</quote
> or <quote
>b</quote
>, in any case, you need to write it <userinput
>[aAbB]</userinput
>.</para>

<para
>It is of cause possible to create a <quote
>negative</quote
> class matching as <quote
>anything but</quote
> To do so put a caret (<literal
>^</literal
>) at the beginning of the class: </para>

<para
><userinput
>[^abc]</userinput
> will match any character <emphasis
>but</emphasis
> <quote
>a</quote
>, <quote
>b</quote
> or <quote
>c</quote
>.</para>

<para
>In addition to literal characters, some abbreviations are defined, making life still a bit easier: <variablelist>

<varlistentry>
<term
><userinput
>\a</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> bell character (BEL, 0x07).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\f</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> form feed character (FF, 0x0C).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\n</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> line feed character (LF, 0x0A, Unix newline).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\r</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> carriage return character (CR, 0x0D).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\t</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> horizontal tab character (HT, 0x09).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\v</userinput
></term>
<listitem
><para
>This matches the <acronym
>ASCII</acronym
> vertical tab character (VT, 0x0B).</para
></listitem>
</varlistentry>
<varlistentry>
<term
><userinput
>\xhhhh</userinput
></term>

<listitem
><para
>This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;, \zero ooo) matches the <acronym
>ASCII</acronym
>/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>.</userinput
> (dot)</term>
<listitem
><para
>This matches any character (including newline).</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\d</userinput
></term>
<listitem
><para
>This matches a digit. Equal to <literal
>[0-9]</literal
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\D</userinput
></term>
<listitem
><para
>This matches a non-digit. Equal to <literal
>[^0-9]</literal
> or <literal
>[^\d]</literal
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\s</userinput
></term>
<listitem
><para
>This matches a whitespace character. Practically equal to <literal
>[ \t\n\r]</literal
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\S</userinput
></term>
<listitem
><para
>This matches a non-whitespace. Practically equal to <literal
>[^ \t\r\n]</literal
>, and equal to <literal
>[^\s]</literal
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\w</userinput
></term>
<listitem
><para
>Matches any <quote
>word character</quote
> - in this case any letter or digit. Note that underscore (<literal
>_</literal
>) is not matched, as is the case with perl regular expressions. Equal to <literal
>[a-zA-Z0-9]</literal
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\W</userinput
></term>
<listitem
><para
>Matches any non-word character - anything but letters or numbers. Equal to <literal
>[^a-zA-Z0-9]</literal
> or <literal
>[^\w]</literal
></para
></listitem>
</varlistentry>


</variablelist>

</para>

<para
>The abbreviated classes can be put inside a custom class, for example to match a word character, a blank or a dot, you could write <userinput
>[\w \.]</userinput
></para
>

<note
> <para
>The POSIX notation of classes, <userinput
>[:&lt;class name&gt;:]</userinput
> is currently not supported.</para
> </note>

<sect3>
<title
>Characters with special meanings inside character classes</title>

<para
>The following characters has a special meaning inside the <quote
>[]</quote
> character class construct, and must be escaped to be literally included in a class:</para>

<variablelist>
<varlistentry>
<term
><userinput
>]</userinput
></term>
<listitem
><para
>Ends the character class. Must be escaped unless it is the very first character in the class (may follow an unescaped caret)</para
></listitem>
</varlistentry>
<varlistentry>
<term
><userinput
>^</userinput
> (caret)</term>
<listitem
><para
>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para
></listitem
>
</varlistentry>
<varlistentry>
<term
><userinput
>-</userinput
> (dash)</term>
<listitem
><para
>Denotes a logical range. Must always be escaped within a character class.</para
></listitem>
</varlistentry>
<varlistentry>
<term
><userinput
>\</userinput
> (backslash)</term>
<listitem
><para
>The escape character. Must always be escaped.</para
></listitem>
</varlistentry>

</variablelist>

</sect3>

</sect2>

<sect2>

<title
>Alternatives: matching <quote
>one of</quote
></title>

<para
>If you want to match one of a set of alternative patterns, you can separate those with <literal
>|</literal
> (vertical bar character).</para>

<para
>For example to find either <quote
>John</quote
> or <quote
>Harry</quote
> you would use an expression <userinput
>John|Harry</userinput
>.</para>

</sect2>

<sect2>

<title
>Sub Patterns</title>

<para
><emphasis
>Sub patterns</emphasis
> are patterns enclosed in parentheses, and they have several uses in the world of regular expressions.</para>

<sect3>

<title
>Specifying alternatives</title>

<para
>You may use a sub pattern to group a set of alternatives within a larger pattern. The alternatives are separated by the character <quote
>|</quote
> (vertical bar).</para>

<para
>For example to match either of the words <quote
>int</quote
>, <quote
>float</quote
> or <quote
>double</quote
>, you could use the pattern <userinput
>int|float|double</userinput
>. If you only want to find one if it is followed by some whitespace and then some letters, put the alternatives inside a subpattern: <userinput
>(int|float|double)\s+\w+</userinput
>.</para>

</sect3>

<sect3>

<title
>Capturing matching text (back references)</title>

<para
>If you want to use a back reference, use a sub pattern to have the desired part of the pattern remembered.</para>

<para
>For example, it you want to find two occurrences of the same word separated by a comma and possibly some whitespace, you could write <userinput
>(\w+),\s*\1</userinput
>. The sub pattern <literal
>\w+</literal
> would find a chunk of word characters, and the entire expression would match if those were followed by a comma, 0 or more whitespace and then an equal chunk of word characters. (The string <literal
>\1</literal
> references <emphasis
>the first sub pattern enclosed in parentheses</emphasis
>)</para>

<!-- <para
>See also <link linkend="backreferences"
>Back references</link
>.</para
> -->

</sect3>

<sect3 id="lookahead-assertions">
<title
>Lookahead Assertions</title>

<para
>A lookahead assertion is a sub pattern, starting with either <literal
>?=</literal
> or <literal
>?!</literal
>.</para>

<para
>For example to match the literal string <quote
>Bill</quote
> but only if not followed by <quote
> Gates</quote
>, you could use this expression: <userinput
>Bill(?! Gates)</userinput
>. (This would find <quote
>Bill Clinton</quote
> as well as <quote
>Billy the kid</quote
>, but silently ignore the other matches.)</para>

<para
>Sub patterns used for assertions are not captured.</para>

<para
>See also <link linkend="assertions"
>Assertions</link
></para>

</sect3>

</sect2>

<sect2 id="special-characters-in-patterns">
<title
>Characters with a special meaning inside patterns</title>

<para
>The following characters have meaning inside a pattern, and must be escaped if you want to literally match them: <variablelist>

<varlistentry>
<term
><userinput
>\</userinput
> (backslash)</term>
<listitem
><para
>The escape character.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>^</userinput
> (caret)</term>
<listitem
><para
>Asserts the beginning of the string.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>$</userinput
></term>
<listitem
><para
>Asserts the end of string.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>()</userinput
> (left and right parentheses)</term>
<listitem
><para
>Denotes sub patterns.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>{}</userinput
> (left and right curly braces)</term>
<listitem
><para
>Denotes numeric quantifiers.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>[]</userinput
> (left and right square brackets)</term>
<listitem
><para
>Denotes character classes.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>|</userinput
> (vertical bar)</term>
<listitem
><para
>logical OR. Separates alternatives.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>+</userinput
> (plus sign)</term>
<listitem
><para
>Quantifier, 1 or more.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>*</userinput
> (asterisk)</term>
<listitem
><para
>Quantifier, 0 or more.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>?</userinput
> (question mark)</term>
<listitem
><para
>An optional character. Can be interpreted as a quantifier, 0 or 1.</para
></listitem>
</varlistentry>

</variablelist>

</para>

</sect2>

</sect1>

<sect1 id="quantifiers">
<title
>Quantifiers</title>

<para
><emphasis
>Quantifiers</emphasis
> allows a regular expression to match a specified number or range of numbers of either a character, character class or sub pattern.</para>

<para
>Quantifiers are enclosed in curly brackets (<literal
>{</literal
> and <literal
>}</literal
>) and have the general form <literal
>{[minimum-occurrences][,[maximum-occurrences]]}</literal
> </para>

<para
>The usage is best explained by example: <variablelist>

<varlistentry>
<term
><userinput
>{1}</userinput
></term>
<listitem
><para
>Exactly 1 occurrence</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>{0,1}</userinput
></term>
<listitem
><para
>Zero or 1 occurrences</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>{,1}</userinput
></term>
<listitem
><para
>The same, with less work;)</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>{5,10}</userinput
></term>
<listitem
><para
>At least 5 but maximum 10 occurrences.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>{5,}</userinput
></term>
<listitem
><para
>At least 5 occurrences, no maximum.</para
></listitem>
</varlistentry>

</variablelist>

</para>

<para
>Additionally, there are some abbreviations: <variablelist>

<varlistentry>
<term
><userinput
>*</userinput
> (asterisk)</term>
<listitem
><para
>similar to <literal
>{0,}</literal
>, find any number of occurrences.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>+</userinput
> (plus sign)</term>
<listitem
><para
>similar to <literal
>{1,}</literal
>, at least 1 occurrence.</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>?</userinput
> (question mark)</term>
<listitem
><para
>similar to <literal
>{0,1}</literal
>, zero or 1 occurrence.</para
></listitem>
</varlistentry>

</variablelist>

</para>

<sect2>

<title
>Greed</title>

<para
>When using quantifiers with no maximum, regular expressions defaults to match as much of the searched string as possible, commonly known as <emphasis
>greedy</emphasis
> behaviour.</para>

<para
>Modern regular expression software provides the means of <quote
>turning off greediness</quote
>, though in a graphical environment it is up to the interface to provide you with access to this feature. For example a search dialogue providing a regular expression search could have a check box labelled <quote
>Minimal matching</quote
> as well as it ought to indicate if greediness is the default behaviour.</para>

</sect2>

<sect2>
<title
>In context examples</title>

<para
>Here are a few examples of using quantifiers</para>

<variablelist>

<varlistentry>
<term
><userinput
>^\d{4,5}\s</userinput
></term>
<listitem
><para
>Matches the digits in <quote
>1234 go</quote
> and <quote
>12345 now</quote
>, but neither in <quote
>567 eleven</quote
> nor in <quote
>223459 somewhere</quote
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\s+</userinput
></term>
<listitem
><para
>Matches one or more whitespace characters</para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>(bla){1,}</userinput
></term>
<listitem
><para
>Matches all of <quote
>blablabla</quote
> and the <quote
>bla</quote
> in <quote
>blackbird</quote
> or <quote
>tabla</quote
></para
></listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>/?&gt;</userinput
></term>
<listitem
><para
>Matches <quote
>/&gt;</quote
> in <quote
>&lt;closeditem/&gt;</quote
> as well as <quote
>&gt;</quote
> in <quote
>&lt;openitem&gt;</quote
>.</para
></listitem>
</varlistentry>

</variablelist>

</sect2>

</sect1>

<sect1 id="assertions">
<title
>Assertions</title>

<para
><emphasis
>Assertions</emphasis
> allows a regular expression to match only under certain controlled conditions.</para>

<para
>An assertion does not need a character to match, it rather investigates the surroundings of a possible match before acknowledging it. For example the <emphasis
>word boundary</emphasis
> assertion does not try to find a non word character opposite a word one at its position, instead it makes sure that there is not a word character. This means that the assertion can match where there is no character, &ie; at the ends of a searched string.</para>

<para
>Some assertions actually does have a pattern to match, but the part of the string matching that will not be a part of the result of the match of the full expression.</para>

<para
>Regular Expressions as documented here supports the following assertions: <variablelist>

<varlistentry
>
<term
><userinput
>^</userinput
> (caret: beginning of string)</term
>
<listitem
><para
>Matches the beginning of the searched string.</para
> <para
>The expression <userinput
>^Peter</userinput
> will match at <quote
>Peter</quote
> in the string <quote
>Peter, hey!</quote
> but not in <quote
>Hey, Peter!</quote
> </para
> </listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>$</userinput
> (end of string)</term>
<listitem
><para
>Matches the end of the searched string.</para>

<para
>The expression <userinput
>you\?$</userinput
> will match at the last you in the string <quote
>You didn't do that, did you?</quote
> but nowhere in <quote
>You didn't do that, right?</quote
></para>

</listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>\b</userinput
> (word boundary)</term>
<listitem
><para
>Matches if there is a word character at one side and not a word character at the other.</para>
<para
>This is useful to find word ends, for example both ends to find a whole word. The expression <userinput
>\bin\b</userinput
> will match at the separate <quote
>in</quote
> in the string <quote
>He came in through the window</quote
>, but not at the <quote
>in</quote
> in <quote
>window</quote
>.</para
></listitem>

</varlistentry>

<varlistentry>
<term
><userinput
>\B</userinput
> (non word boundary)</term>
<listitem
><para
>Matches wherever <quote
>\b</quote
> does not.</para>
<para
>That means that it will match for example within words: The expression <userinput
>\Bin\B</userinput
> will match at in <quote
>window</quote
> but not in <quote
>integer</quote
> or <quote
>I'm in love</quote
>.</para>
</listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>(?=PATTERN)</userinput
> (Positive lookahead)</term>
<listitem
><para
>A lookahead assertion looks at the part of the string following a possible match. The positive lookahead will prevent the string from matching if the text following the possible match does not match the <emphasis
>PATTERN</emphasis
> of the assertion, but the text matched by that will not be included in the result.</para>
<para
>The expression <userinput
>handy(?=\w)</userinput
> will match at <quote
>handy</quote
> in <quote
>handyman</quote
> but not in <quote
>That came in handy!</quote
></para>
</listitem>
</varlistentry>

<varlistentry>
<term
><userinput
>(?!PATTERN)</userinput
> (Negative lookahead)</term>

<listitem
><para
>The negative lookahead prevents a possible match to be acknowledged if the following part of the searched string does match its <emphasis
>PATTERN</emphasis
>.</para>
<para
>The expression <userinput
>const \w+\b(?!\s*&amp;)</userinput
> will match at <quote
>const char</quote
> in the string <quote
>const char* foo</quote
> while it can not match <quote
>const QString</quote
> in <quote
>const QString&amp; bar</quote
> because the <quote
>&amp;</quote
> matches the negative lookahead assertion pattern.</para>
</listitem>
</varlistentry>

</variablelist>

</para>

</sect1>

<!-- TODO sect1 id="backreferences">

<title
>Back References</title>

<para
></para>

</sect1 -->

</appendix>