You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
tdelibs/tdecore/README.tdestringmatcher

122 lines
5.5 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

The TDEStringMatcher class provides string matching against a list of one
or more match patterns along with associated options. A single pattern with
its associated options will be referred to herein as a "match specification".
Current match specification options include:
* Type of match pattern:
REGEX: Pattern is a regular expression.
WILDCARD: Pattern is a wildcard expression like that used
in POSIX shell file globbing.
SUBSTRING: Pattern is a simple substring that matches any
string in which it occurs. Substring characters do not
have any other meaning that controls matching.
* Alphanumeric character handling in a pattern:
NONE: each unescaped alphanumeric character in a pattern
is distinct and will match only itself.
CASE INSENSITIVE: each unescaped letter in a pattern
will match its lower and upper case variants.
EQUIVALENCE: Each unescaped variant of an alphanumeric
character will match all stylistic and accented
variations of that character.
* Desired outcome of matching
TRUE: match succeeds if a string matches the match pattern.
FALSE: match succeeds if a string does NOT match the match pattern.
Applications may set and get match specification lists either directly or
indirectly (using an encoded match specifications string). The matching
functions provided are:
matchAny(): strings match if they match any pattern in list.
matchAll(): strings match only if the match all patterns in list.
MATCH SPECIFICATIONS STRING
The TDEStringMatcher class provides applications an encoded match
specifications string solely intended to be used for storing and retrieving
match specifications. These strings are formatted as follows:
OptionString <Tab> PatternString [ <Tab> OptionString <Tab> PatternString ...]
Option strings may contain only the following characters:
'r' - Match pattern is a regular expression [default]
'w' - Match pattern is a wildcard expression
's' - Match pattern is a simple substring
'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
'i' - Letter case variants are equivalent (e.g. case-insensitive)
'e' - All letter & number character variants are equivalent
'=' - Match succeeds if pattern matches [default]
'!' - Match succeeds if pattern does NOT match (inverted match)
Option strings should ideally contain exactly 3 characters indicating match
pattern type, alphanumeric character handling, and desired outcome of matching.
Specifying fewer option characters is possible but may result in unexpected
inferred values. Specifying additional and possibly contradictory option
characters is also possible, with later characters overriding earlier ones.
Pattern strings may not be empty. Invalid pattern strings will cause the
entire match specifications string to be rejected.
Match specifications strings that are stored in TDE configuration files will
be modified as follows:
'\' characters in original pattern are encoded as '\\'
The <Tab> separator is encoded as '\t'
Using file name matching as an example, the match specifications string:
wc= .* rc= ~$ se! e ri= ^a.+\.[0-9]+$
encoded in a TDE configuration file as:
wc=\t.*\trc=\t~$\tse!\te\tri=\t^a.+\\.[0-9]+$
will match file names as follows:
* All "dotfiles" would be matched with wildcard matching.
* All file names ending with '~' (e.g kwrite backup names) would be
matched with case-sensitive regex matching.
* All filenames that do NOT contain an equivalent variant of the letter
'e' (e.g. 'e','ê','Ě','') would be matched with substring matching.
* All file names starting with letter 'a' or 'A' and ending with '.'
followed by one or more numeric digits would be matched with case-
insensitive regex matching.
IMPLEMENTATION NOTES:
* Regular expressions are currently supported by TQRegExp and are
thereby subject to its limitations and bugs. This may be changed
in the future (e.g. direct access to PCRE2, porting of Qt 5.x
QRegularExpression).
* Wildcard pattern matching on GLIBC systems is done using the fnmatch
function with GNU extended patterns supported. Consult the fnmatch(3)
and glob(7) manual pages for more information. On non-GLIBC systems,
basic (not extended) wildcard patterns are converted to basic regular
expressions and processed by the underlying regular expression engine.
* Simple substrings are also supported as match patterns. These are
currently processed by the TQString.find() function. In the future,
these may be converted and processed by the underlying regex engine,
depending on the tradeoff between code simplification and efficiency.
* Alphanumeric equivalence is conceptually similar to [=x=] POSIX
equivalence class bracket expressions (which are not supported)
but is intended to apply globally in patterns. The following
are caveats when this option is utilized:
- There is potentially significant overhead due to the fact that
match patterns and match strings must be converted prior to
matching. Conversion requires character-by-character lookup
and replacement using a pre-built table.
- The table contains equivalents for [0-9A-Z] which should work
well for Latin-derived languages. It also contains support for
other numeric and non-latin letter characters, the efficacy of
which is not as certain.
- Due to the 16-bit size limitation of TQChar, the table does not
contain mappings for codepoints greater than U+FFFF.