|
|
|
|
The TDEStringMatcher class provides string matching against a list of one
|
|
|
|
|
or more match patterns along with associated options. A single pattern with
|
|
|
|
|
its associated options will be referred to herein as a "match specification".
|
|
|
|
|
|
|
|
|
|
Current match specification options include:
|
|
|
|
|
|
|
|
|
|
* Type of match pattern:
|
|
|
|
|
|
|
|
|
|
REGEX: Pattern is a regular expression.
|
|
|
|
|
WILDCARD: Pattern is a wildcard expression like that used
|
|
|
|
|
in POSIX shell file globbing.
|
|
|
|
|
SUBSTRING: Pattern is a simple substring that matches any
|
|
|
|
|
string in which it occurs. Substring characters do not
|
|
|
|
|
have any other meaning that controls matching.
|
|
|
|
|
|
|
|
|
|
* Alphanumeric character handling in a pattern:
|
|
|
|
|
|
|
|
|
|
NONE: each unescaped alphanumeric character in a pattern
|
|
|
|
|
is distinct and will match only itself.
|
|
|
|
|
CASE INSENSITIVE: each unescaped letter in a pattern
|
|
|
|
|
will match its lower and upper case variants.
|
|
|
|
|
EQUIVALENCE: Each unescaped variant of an alphanumeric
|
|
|
|
|
character will match all stylistic and accented
|
|
|
|
|
variations of that character.
|
|
|
|
|
|
|
|
|
|
* Desired outcome of matching
|
|
|
|
|
|
|
|
|
|
TRUE: match succeeds if a string matches the match pattern.
|
|
|
|
|
FALSE: match succeeds if a string does NOT match the match pattern.
|
|
|
|
|
|
|
|
|
|
Applications may set and get match specification lists either directly or
|
|
|
|
|
indirectly (using an encoded match specifications string). The matching
|
|
|
|
|
functions provided are:
|
|
|
|
|
|
|
|
|
|
matchAny(): strings match if they match any pattern in list.
|
|
|
|
|
matchAll(): strings match only if the match all patterns in list.
|
|
|
|
|
|
|
|
|
|
MATCH SPECIFICATIONS STRING
|
|
|
|
|
|
|
|
|
|
The TDEStringMatcher class provides applications an encoded match
|
|
|
|
|
specifications string solely intended to be used for storing and retrieving
|
|
|
|
|
match specifications. These strings are formatted as follows:
|
|
|
|
|
|
|
|
|
|
OptionString <Tab> PatternString [ <Tab> OptionString <Tab> PatternString ...]
|
|
|
|
|
|
|
|
|
|
Option strings may contain only the following characters:
|
|
|
|
|
|
|
|
|
|
'r' - Match pattern is a regular expression [default]
|
|
|
|
|
'w' - Match pattern is a wildcard expression
|
|
|
|
|
's' - Match pattern is a simple substring
|
|
|
|
|
'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
|
|
|
|
|
'i' - Letter case variants are equivalent (e.g. case-insensitive)
|
|
|
|
|
'e' - All letter & number character variants are equivalent
|
|
|
|
|
'=' - Match succeeds if pattern matches [default]
|
|
|
|
|
'!' - Match succeeds if pattern does NOT match (inverted match)
|
|
|
|
|
|
|
|
|
|
Option strings should ideally contain exactly 3 characters indicating match
|
|
|
|
|
pattern type, alphanumeric character handling, and desired outcome of matching.
|
|
|
|
|
Specifying fewer option characters is possible but may result in unexpected
|
|
|
|
|
inferred values. Specifying additional and possibly contradictory option
|
|
|
|
|
characters is also possible, with later characters overriding earlier ones.
|
|
|
|
|
|
|
|
|
|
Pattern strings may not be empty. Invalid pattern strings will cause the
|
|
|
|
|
entire match specifications string to be rejected.
|
|
|
|
|
|
|
|
|
|
Match specifications strings that are stored in TDE configuration files will
|
|
|
|
|
be modified as follows:
|
|
|
|
|
|
|
|
|
|
'\' characters in original pattern are encoded as '\\'
|
|
|
|
|
The <Tab> separator is encoded as '\t'
|
|
|
|
|
|
|
|
|
|
Using file name matching as an example, the match specifications string:
|
|
|
|
|
wc= .* rc= ~$ se! e ri= ^a.+\.[0-9]+$
|
|
|
|
|
encoded in a TDE configuration file as:
|
|
|
|
|
wc=\t.*\trc=\t~$\tse!\te\tri=\t^a.+\\.[0-9]+$
|
|
|
|
|
will match file names as follows:
|
|
|
|
|
|
|
|
|
|
* All "dotfiles" would be matched with wildcard matching.
|
|
|
|
|
* All file names ending with '~' (e.g kwrite backup names) would be
|
|
|
|
|
matched with case-sensitive regex matching.
|
|
|
|
|
* All filenames that do NOT contain an equivalent variant of the letter
|
|
|
|
|
'e' (e.g. 'e','ê','Ě','E') would be matched with substring matching.
|
|
|
|
|
* All file names starting with letter 'a' or 'A' and ending with '.'
|
|
|
|
|
followed by one or more numeric digits would be matched with case-
|
|
|
|
|
insensitive regex matching.
|
|
|
|
|
|
|
|
|
|
IMPLEMENTATION NOTES:
|
|
|
|
|
|
|
|
|
|
* Regular expressions are currently supported by TQRegExp and are
|
|
|
|
|
thereby subject to its limitations and bugs. This may be changed
|
|
|
|
|
in the future (e.g. direct access to PCRE2, porting of Qt 5.x
|
|
|
|
|
QRegularExpression).
|
|
|
|
|
|
|
|
|
|
* Wildcard pattern matching on GLIBC systems is done using the fnmatch
|
|
|
|
|
function with GNU extended patterns supported. Consult the fnmatch(3)
|
|
|
|
|
and glob(7) manual pages for more information. On non-GLIBC systems,
|
|
|
|
|
basic (not extended) wildcard patterns are converted to basic regular
|
|
|
|
|
expressions and processed by the underlying regular expression engine.
|
|
|
|
|
|
|
|
|
|
* Simple substrings are also supported as match patterns. These are
|
|
|
|
|
currently processed by the TQString.find() function. In the future,
|
|
|
|
|
these may be converted and processed by the underlying regex engine,
|
|
|
|
|
depending on the tradeoff between code simplification and efficiency.
|
|
|
|
|
|
|
|
|
|
* Alphanumeric equivalence is conceptually similar to [=x=] POSIX
|
|
|
|
|
equivalence class bracket expressions (which are not supported)
|
|
|
|
|
but is intended to apply globally in patterns. The following
|
|
|
|
|
are caveats when this option is utilized:
|
|
|
|
|
|
|
|
|
|
- There is potentially significant overhead due to the fact that
|
|
|
|
|
match patterns and match strings must be converted prior to
|
|
|
|
|
matching. Conversion requires character-by-character lookup
|
|
|
|
|
and replacement using a pre-built table.
|
|
|
|
|
|
|
|
|
|
- The table contains equivalents for [0-9A-Z] which should work
|
|
|
|
|
well for Latin-derived languages. It also contains support for
|
|
|
|
|
other numeric and non-latin letter characters, the efficacy of
|
|
|
|
|
which is not as certain.
|
|
|
|
|
|
|
|
|
|
- Due to the 16-bit size limitation of TQChar, the table does not
|
|
|
|
|
contain mappings for codepoints greater than U+FFFF.
|