tdeutils/doc/KRegExpEditor/index.docbook

<?xml version="1.0" ?>
<!DOCTYPE book PUBLIC "-//KDE//DTD DocBook XML V4.2-Based Variant V1.1//EN" "dtd/kdex.dtd" [
 <!ENTITY % English "INCLUDE">
 <!ENTITY % addindex "IGNORE">
]>

<book lang="&language;">

  <bookinfo>
    <title>The Regular Expression Editor Manual</title>

    <authorgroup>
      <author>
        <firstname>Jesper K.</firstname>
        <surname>Pedersen</surname>
        <affiliation><address><email>blackie@kde.org</email></address></affiliation>
      </author>
    </authorgroup>

    <date>2001-07-03</date>
    <releaseinfo>0.1</releaseinfo>

    <legalnotice>&underFDL;</legalnotice>

    <copyright>
      <year>2001</year>
      <holder>Jesper K. Pedersen</holder>
    </copyright>

    <abstract>
      <para>This Handbook describes the Regular Expression Editor widget</para>
    </abstract>

    <keywordset>
      <keyword>KDE</keyword>
      <keyword>regular expression</keyword>
    </keywordset>
  </bookinfo>

  <!-- ====================================================================== -->
  <!--                               Introduction                             -->
  <!-- ====================================================================== -->
  <chapter id="introduction">
    <title>Introduction</title>


    <para>
      The regular expression editor is an editor for editing regular expression
      in a graphical style (in contrast to the <acronym>ASCII</acronym> syntax). Traditionally
      regular expressions have been typed in the <acronym>ASCII</acronym> syntax, which for example
      looks like <literal>^.*kde\b</literal>. The major drawbacks of
      this style are:
      <itemizedlist>
        <listitem><para>It is hard to understand for
            non-programmers.</para></listitem>

        <listitem><para>It requires that you <emphasis>escape</emphasis>
            certain symbols (to match a star for example, you need to type
            <literal>\*</literal>). </para></listitem>

        <listitem><para>It requires that you remember rules for
            <emphasis>precedence</emphasis> (What does <literal>x|y*</literal>
            match? a single <literal>x</literal> or a number of
            <literal>y</literal>, <emphasis>OR</emphasis> a number of
            <literal>x</literal> and <literal>y</literal>'s mixed?)</para></listitem>
      </itemizedlist>
    </para>

    <para>
      The regular expression editor, on the other hand, lets you
      <emphasis>draw</emphasis> your regular expression in an unambiguous
      way. The editor solves at least item two and three above. It might not make
      regular expressions available for the non-programmers, though only tests by
      users can tell that. So, if are you a non programmer, who has gained the
      power of regular expression from this editor, then please
      <ulink url="mailto:blackie@kde.org">let me know</ulink>.
    </para>

  </chapter>

  <!-- ====================================================================== -->
  <!--                       What is a Regular Expression                     -->
  <!-- ====================================================================== -->
  <chapter id="whatIsARegExp">
    <title>What is a Regular Expression</title>

    <para>A regular expression is a way to specify
      <emphasis>conditions</emphasis> to be fulfilled for a situation
      in mind. Normally when you search in a text editor you specify
      the text to search for <emphasis>literally</emphasis>, using a
      regular expression, on the other hand, you tell what a given
      match would look like. Examples of this include <emphasis>I'm
      searching for the word KDE, but only at the beginning of the
      line</emphasis>, or <emphasis>I'm searching for the word
      <literal>the</literal>, but it must stand on its own</emphasis>,
      or <emphasis>I'm searching for files starting with the word
      <literal>test</literal>, followed by a number of digits, for
      example <literal>test12</literal>, <literal>test107</literal>
      and <literal>test007</literal></emphasis></para>

    <para>You build regular expressions from smaller regular
      expressions, just like you build large Lego toys from smaller
      subparts. As in the Lego world, there are a number of basic
      building blocks. In the following I will describe each of these
      basic building blocks using a number of examples.</para>

    <example>
      <title>Searching for normal text.</title>
      <para>If you just want to search for a given text, a then regular
        expression is definitely not a good choice. The reason for this is that
        regular expressions assign special meaning to some characters. This
        includes the following characters: <literal>.*|$</literal>. Thus if you want to
        search for the text <literal>kde.</literal> (i.e. the characters
        <literal>kde</literal> followed by a period), then you would need to
        specify this as <literal>kde\.</literal><footnote><para>The regular
            expression editor solves this problem by taking care of escape rules for
            you.</para></footnote> Writing <literal>\.</literal> rather than just
        <literal>.</literal> is called <emphasis>escaping</emphasis>.
      </para>
    </example>

    <example id="positionregexp">
      <title>Matching URLs</title>
      <para>When you select something looking like a URL in KDE, then the
        program <command>klipper</command> will offer to start
        <command>konqueror</command> with the selected URL.</para>

      <para><command>Klipper</command> does this by matching the selection
        against several different regular expressions, when one of the regular
        expressions matches, the accommodating command will be offered.</para>

      <para>The regular expression for URLs says (among other things), that the
        selection must start with the text <literal>http://</literal>. This is
        described using regular expressions by prefixing the text
        <literal>http://</literal> with a hat (the <literal>^</literal>
        character).</para>

      <para>The above is an example of matching positions using regular
        expressions. Similar, the position <emphasis>end-of-line</emphasis> can
        be matched using the character <literal>$</literal> (i.e. a dollar
        sign).</para>
    </example>

    <example id="boundaryregexp">
      <title>Searching for the word <literal>the</literal>, but not
        <emphasis>the</emphasis><literal>re</literal>,
        <literal>brea</literal><emphasis>the</emphasis> or
        <literal>ano</literal><emphasis>the</emphasis><literal>r</literal></title>
      <para>Two extra position types can be matches in the above way,
        namely <emphasis>the position at a word boundary</emphasis>, and
        <emphasis>the position at a <emphasis>non</emphasis>-word
          boundary</emphasis>. The positions are specified using the text
        <literal>\b</literal> (for word-boundary) and <literal>\B</literal> (for
        non-word boundary)<emphasis></emphasis></para>

      <para>Thus, searching for the word <literal>the</literal> can be done
        using the regular expression <literal>\bthe\b</literal>. This specifies
        that we are searching for <literal>the</literal> with no letters on each
        side of it (i.e. with a word boundary on each side)</para>

      <para>The four position matching regular expressions are inserted in the
        regular expression editor using <link linkend="positiontool">four
          different positions tool</link></para>
    </example>

    <example id="altnregexp">
      <title>Searching for either <literal>this</literal> or <literal>that</literal></title>
      <para>Imagine that you want to run through your document searching for
        either the word <literal>this</literal> or the word
        <literal>that</literal>. With a normal search method you could do this in
        two sweeps, the first time around, you would search for
        <literal>this</literal>, and the second time around you would search for
        <literal>that</literal>.</para>

      <para>Using regular expression searches you would search for both in the
        same sweep. You do this by searching for
        <literal>this|that</literal>. I.e. separating the two words with a
        vertical bar.<footnote><para>Note on each side of the vertical bar is a
            regular expression, so this feature is not only for searching for two
            different pieces of text, but for searching for two different regular
            expressions.</para></footnote></para>

      <para>In the regular expression editor you do not write the vertical bar
        yourself, but instead select the <link linkend="altntool">alternative
          tool</link>, and insert the smaller regular expressions above each other.</para>
    </example>

    <example id="repeatregexp">
      <title>Matching anything</title>
      <para>Regular expressions are often compared to wildcard matching in the
        shell - that is the capability to specify a number of files using the
        asterisk. You will most likely recognize wildcard matching from the
        following examples:
        <itemizedlist>
          <listitem><para><literal>ls *.txt</literal> - here <literal>*.txt</literal> is
              the shell wildcard matching every file ending with the
              <literal>.txt</literal> extension.</para></listitem>
          <listitem><para><literal>cat test??.res</literal> - matching every file starting with
              <literal>test</literal> followed by two arbitrary characters, and finally
              followed by the test <literal>.res</literal></para></listitem>
        </itemizedlist>
      </para>

      <para>In the shell the asterisk matches any character any number of
        times. In other words, the asterisk matches <emphasis>anything</emphasis>.
        This is written like <literal>.*</literal> with regular expression
        syntax. The dot matches any single character, i.e. just
        <emphasis>one</emphasis> character, and the asterisk, says that the
        regular expression prior to it should be matched any number of
        times. Together this says any single character any number of
        times.</para>

      <para>This may seem overly complicated, but when you get the larger
        picture you will see the power. Let me show you another basic regular
        expression: <literal>a</literal>. The letter <literal>a</literal> on its
        own is a regular expression that matches a single letter, namely the
        letter <literal>a</literal>. If we combine this with the asterisk,
        i.e. <literal>a*</literal>, then we have a regular expression matching
        any number of a's.</para>

      <para>We can combine several regular expression after each
        other, for example <literal>ba(na)*</literal>.
        <footnote><para><literal>(na)*</literal> just says that what is inside
            the parenthesis is repeated any number of times.</para></footnote>
        Imagine you had typed this regular expression into the search field in a
        text editor, then you would have found the following words (among
        others): <literal>ba</literal>, <literal>bana</literal>,
        <literal>banana</literal>, <literal>bananananananana</literal>
      </para>

      <para>Given the information above, it hopefully isn't hard for you to write the
        shell wildcard <literal>test??.res</literal> as a regular expression
        Answer: <literal>test..\.res</literal>. The dot on its own is any
        character. To match a single dot you must write
        <literal>\.</literal><footnote><para>This is called escaping</para></footnote>. In
        other word, the regular expression <literal>\.</literal> matches a dot,
        while a dot on its own matches any character. </para>

      <para>In the regular expression editor, a repeated regular expression is
        created using the <link linkend="repeattool">repeat tool</link> </para>
    </example>

    <example id="lookaheadregexp">
      <title>Replacing <literal>&amp;</literal> with
        <literal>&amp;amp;</literal> in a HTML document</title> <para>In
        HTML the special character <literal>&amp;</literal> must be
        written as <literal>&amp;amp;</literal> - this is similar to
        escaping in regular expressions.</para>

      <para>Imagine that you have written an HTML document in a normal editor
        (e.g. XEmacs or Kate), and you totally forgot about this rule. What you
        would do when realized your mistake was to replace every occurrences of
        <literal>&amp;</literal> with <literal>&amp;amp;</literal>.</para>

      <para>This can easily be done using normal search and replace,
        there is, however, one glitch. Imagine that you did remember
        this rule - <emphasis>just a bit</emphasis> - and did it right
        in some places. Replacing unconditionally would result in
        <literal>&amp;amp;</literal> being replaced with
        <literal>&amp;amp;amp;</literal></para>

      <para>What you really want to say is that <literal>&amp;</literal> should
        only be replaced if it is <emphasis>not</emphasis> followed by the letters
        <literal>amp;</literal>. You can do this using regular expressions using
        <emphasis>positive lookahead</emphasis>. </para>

      <para>The regular expression, which only matches an ampersand if it is
        not followed by the letters <literal>amp;</literal> looks as follows:
        <literal>&amp;(?!amp;)</literal>. This is, of course, easier to read using
        the regular expression editor, where you would use the
        <link linkend="lookaheadtools">lookahead tools</link>.</para>
    </example>

  </chapter>

  <!-- ====================================================================== -->
  <!--                    Using the Regular Expression Editor                 -->
  <!-- ====================================================================== -->
  <chapter id="theEditor">
    <title>Using the Regular Expression Editor</title>

    <para>
      This chapter will tell you about how the regular expression editor works.
    </para>

    <!-- ====================================================================== -->
    <!--                   The organization of the screen                       -->
    <!-- ====================================================================== -->
    <sect1 id="screenorganization">
      <title>The organization of the screen</title>

      <mediaobject>
        <imageobject><imagedata format="PNG" fileref="theEditor.png"/></imageobject>
      </mediaobject>

      <para>The most important part of the editor is of course the editing
        area, this is the area where you draw your regular expression. This
        area is the larger gray one in the middle.</para>

      <para>Above the editing area you have two Toolbars, the first one
        contains the <link linkend="editingtools">editing actions</link> -
        much like drawing tools in a drawing program. The second Toolbar
        contains the <emphasis>whats this</emphasis> button, and buttons
        for undo and redo.</para>

      <para>Below the editing area you find the regular expression
        currently build, in the so called ascii syntax. The ascii syntax
        is updated while you edit the regular expression in the graphical
        editor. If you rather want to update the ascii syntax then please
        do, the graphical editor is updated on the fly to reflect your
        changes.</para>

      <para>Finally to the left of the editor area you will find a number
        of pre-built regular expressions. They serve two purposes: (1) When
        you load the editor with a regular expression then this regular
        expression is made <emphasis>nicer</emphasis> or more comprehensive
        by replacing common regular expressions. In the screen dump above,
        you can see how the ascii syntax ".*" have been replaced with a box
        saying "anything". (2) When you insert regular expression you may
        find building blocks for your own regular expression from the set of
        pre build regular expressions. See the section on
        <link linkend="userdefinedregexps">user defined regular
          expressions</link> to learn how to save your own regular expressions.</para>
    </sect1>

    <!-- ====================================================================== -->
    <!--                         Editing Tools                                  -->
    <!-- ====================================================================== -->
    <sect1 id="editingtools">
      <title>Editing Tools</title>
      <para>The text in this section expects that you have read the chapter
        on <link linkend="whatIsARegExp">what a regular expression
          is</link>, or have previous knowledge on this subject.</para>

      <para>All the editing tools are located in the tool bar above
        editing area. Each of them will be described in the following.</para>


      <simplesect id="selecttool">
        <title>Selection Tool</title>
	<mediaobject>
            <imageobject><imagedata format="PNG" fileref="select.png"/>
        </imageobject></mediaobject>
        <para> The selection tool is used to
          mark elements for cut-and-paste and drag-and-drop. This is very
          similar to a selection tool in any drawing program.</para>
      </simplesect>


      <simplesect id="texttool"><title>Text Tool</title>
      <mediaobject>
      <imageobject>
	    <imagedata format="PNG" fileref="text.png"/>
	</imageobject></mediaobject>

        <para><inlinemediaobject><imageobject>
              <imagedata format="PNG" fileref="texttool.png"/>
        </imageobject></inlinemediaobject></para>

        <para>Using this tool you will insert normal text to match. The
        text is matched literally, i.e. you do not have to worry about
        escaping of special characters.  In the example above the following
          regular expression will be build: <literal>abc\*\\\)</literal></para>
      </simplesect>


      <simplesect id="characterstool"><title>Character Tool</title>
      <mediaobject><imageobject>
            <imagedata format="PNG" fileref="characters.png"/>
            </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject>
            <imagedata format="PNG" fileref="charactertool.png"/>
            </imageobject></inlinemediaobject></para>

        <para> Using this tool you insert
          character ranges. Examples includes what in ASCII text says
          <literal>[0-9]</literal>, <literal>[^a-zA-Z,_]</literal>. When
          inserting an item with this tool a  dialog will appear, in which
          you specify the character ranges.</para>

        <para>See description of <link linkend="repeatregexp">repeated
            regular expressions</link>.</para>
      </simplesect>


      <simplesect id="anychartool"><title>Any Character Tool</title>
        <mediaobject><imageobject>
              <imagedata format="PNG" fileref="anychar.png"/>
        </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject><imagedata format="PNG" fileref="anychartool.png"/>
        </imageobject></inlinemediaobject></para>

        <para>This is the regular expression "dot" (.). It matches any
          single character.</para>


        </simplesect>


      <simplesect id="repeattool"><title>Repeat Tool</title>
      <mediaobject><imageobject>
            <imagedata format="PNG" fileref="repeat.png"/>
            </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject>
            <imagedata format="PNG" fileref="repeattool.png"/>
            </imageobject></inlinemediaobject></para>

        <para>This is the repeated
            elements. This includes what in ASCII syntax is represented
            using an asterix (*), a plus (+), a question mark (?), and
            ranges ({3,5}). When you insert an item using this tool, a
            dialog will appear asking for the number of times to
            repeat.</para>

          <para>You specify what to repeat by drawing the repeated content
            inside the box which this tool inserts.</para>

          <para>Repeated elements can both be built from the outside in and
	the inside
            out. That is you can first draw what to be repeated, select it
            and use the repeat tool to repeat it. Alternatively you can
            first insert the repeat element, and draw what is to be repeated
            inside it.</para>

        <para>See description on the <link linkend="repeatregexp">repeated
            regular expressions</link>.</para>
        </simplesect>


      <simplesect id="altntool"><title>Alternative Tool</title>
      <mediaobject><imageobject>
            <imagedata format="PNG" fileref="altn.png"/>
            </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject><imagedata format="PNG" fileref="altntool.png"/>
        </imageobject></inlinemediaobject></para>

        <para>This is the alternative regular expression (|). You specify
          the alternatives by drawing each alternative on top of each other
          inside the box that this tool inserts.</para>

        <para>See description on <link linkend="altnregexp">alternative
            regular expressions</link></para>
      </simplesect>


      <simplesect id="compoundtool"><title>Compound Tool</title>
        <mediaobject><imageobject>
              <imagedata format="PNG" fileref="compound.png"/>
        </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject><imagedata format="PNG" fileref="compoundtool.png"/>
        </imageobject></inlinemediaobject></para>

        <para>The compound tool does not represent any regular
          expressions. It is used to group other sub parts together in a
          box, which easily can be collapsed to only its title. This can be
          seen in the right part of the screen dump above.</para>
      </simplesect>


      <simplesect id="positiontool"><title>Line Start/End Tools</title>
        <mediaobject><imageobject>
            <imagedata format="PNG" fileref="begline.png"/>
        </imageobject></mediaobject>
          <mediaobject><imageobject>
              <imagedata format="PNG" fileref="endline.png"/>
        </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject><imagedata format="PNG" fileref="linestartendtool.png"/>
        </imageobject></inlinemediaobject></para>

        <para>The line start and line end tools matches the start of the
          line, and the end of the line respectively. The regular
          expression in the screen dump above thus matches lines only
          matches spaces.</para>

        <para>See description of <link linkend="positionregexp">position
            regular expressions</link>.</para>
      </simplesect>


      <simplesect><title>Word (Non)Boundary Tools</title>
      <mediaobject><imageobject>
            <imagedata format="PNG" fileref="wordboundary.png"/>
            </imageobject></mediaobject>
          <mediaobject><imageobject><imagedata format="PNG" fileref="nonwordboundary.png"/>
        </imageobject></mediaobject>
        <para><inlinemediaobject><imageobject><imagedata format="PNG" fileref="boundarytools.png"/>
        </imageobject></inlinemediaobject></para>

        <para>The boundary tools matches a word boundary respectively a
          non-word boundary. The regular expression in the screen dump thus
          matches any words starting with <literal>the</literal>. The word
          <literal>the</literal> itself is, however, not matched.</para>

        <para>See description of <link linkend="boundaryregexp">boundary
            regular expressions</link>.</para>
      </simplesect>


      <simplesect id="lookaheadtools"><title>Positive/Negative Lookahead
          Tools</title>
	  <mediaobject><imageobject> <imagedata format="PNG" fileref="poslookahead.png"/>
        </imageobject></mediaobject>
          <mediaobject><imageobject> <imagedata format="PNG" fileref="neglookahead.png"/>
        </imageobject></mediaobject>

        <para><inlinemediaobject><imageobject> <imagedata format="PNG" fileref="lookaheadtools.png"/>
        </imageobject></inlinemediaobject></para>

        <para>The look ahead tools either specify a positive or negative
          regular expression to match. The match is, however, not part of
          the total match.</para>

        <para>Note: You are only allowed to place lookaheads at the end
          of the regular expressions. The Regular Expression Editor widget
          does not enforce this.</para>

        <para>See description of <link linkend="lookaheadregexp">look ahead
          regular expressions</link>.</para>
      </simplesect>
    </sect1>

  <!-- ====================================================================== -->
  <!--                  User Defined Regular Expressions                      -->
  <!-- ====================================================================== -->
    <sect1 id="userdefinedregexps">
      <title>User Defined Regular Expressions</title>
      <para>Located at the left of the editing area is a list box
        containing user defined regular expressions. Some regular
        expressions are pre-installed with your KDE installation, while
        others you can save yourself.</para>

      <para>These regular expression serves two purposes
        (<link linkend="screenorganization">see detailed
          description</link>), namely (1) to offer you a set of building
        block and (2) to make common regular expressions prettier.</para>

      <para>You can save your own regular expressions by right clicking the
        mouse button in the editing area, and choosing <literal>Save Regular
          Expression</literal>.</para>

      <para>If the regular expression you save is within a
        <link linkend="compoundtool">compound container</link> then the
        regular expression will take part in making subsequent regular
        expressions prettier.</para>

      <para>User defined regular expressions can be deleted or renamed by
        pressing the right mouse button on top of the regular expression in
        question in the list box.</para>
    </sect1>
  </chapter>

  <!-- ====================================================================== -->
  <!--                  Reporting a bug and Suggesting Features               -->
  <!-- ====================================================================== -->
  <chapter id="bugreport">
    <title>Reporting bugs and Suggesting Features</title>
    <para>Bug reports and feature requests should be submitted through the
      <ulink url="http://bugs.trinitydesktop.org/">KDE Bug Tracking System</ulink>. <emphasis
                                                                                   role="strong">Before</emphasis> you report a bug or suggest a feature,
      please check that it hasn't already been
      <ulink url="http://bugs.trinitydesktop.org/simple_search.cgi?id=kregexpeditor">reported/suggested.</ulink></para>
  </chapter>

  <!-- ====================================================================== -->
  <!--                                 FAQ                                    -->
  <!-- ====================================================================== -->
  <chapter id="faq">
    <title>Frequently Asked Questions</title>
    <sect1 id="question1">
      <title>Does the regular expression editor support back references?</title>
      <para>No currently this is not supported. It is planned for the next
        version.</para>
    </sect1>

    <sect1 id="question2">
      <title>Does the regular expression editor support showing matches?</title>
      <para>No, hopefully this will be available in the next version.</para>
    </sect1>

    <sect1 id="question3">
      <title>I'm the author of a KDE program, how can I use this widget in
        my application?</title>
      <para>See <ulink
                       url="http://developer.kde.org/documentation/library/cvs-api/classref/interfaces/KRegExpEditorInterface.html">The documentation for the class KRegExpEditorInterface</ulink>.</para>
    </sect1>

    <sect1 id="question4">
      <title>I can't find the <emphasis>Edit Regular expression</emphasis> button in for example
        konqueror on another KDE3 installation, why?</title>
      <para>The regular expression widget is located in the package
        KDE-utils. If you do not have this package installed, then the
        <emphasis>edit regular expressions</emphasis> buttons will not
        appear in the programs.</para>
    </sect1>
  </chapter>

  <!-- ====================================================================== -->
  <!--                           Credits and Licenses                         -->
  <!-- ====================================================================== -->
  <chapter id="credits-and-license">
    <title>Credits and Licenses</title>

    <para>
      Documentation is copyright 2001, Jesper K. Pedersen
      <email>blackie@kde.org</email>
    </para>


    &underGPL;
    &underFDL;

  </chapter>


</book>