|
|
The TDEStringMatcher class provides string matching against a list of one
|
|
|
or more match patterns along with associated options. A single pattern with
|
|
|
its associated options will be referred to herein as a "match specification".
|
|
|
|
|
|
Current match specification options include:
|
|
|
|
|
|
* Type of match pattern:
|
|
|
|
|
|
REGEX: Pattern is a regular expression whose syntax is
|
|
|
currently limited to that supported by the TQRegExp class.
|
|
|
WILDCARD: Pattern is a wildcard expression used in POSIX
|
|
|
shell file globbing.
|
|
|
SUBSTRING: Pattern is a simple substring that matches any
|
|
|
string in which it occurs. Substring characters do not
|
|
|
have any other meaning that controls matching.
|
|
|
|
|
|
* Alphanumeric character handling in a pattern:
|
|
|
|
|
|
NONE: each unescaped alphanumeric character in a pattern
|
|
|
is distinct and will match only itself.
|
|
|
CASE INSENSITIVE: each unescaped letter in a pattern
|
|
|
will match its lower and upper case variants.
|
|
|
EQUIVALENCE: Each unescaped variant of an alphanumeric
|
|
|
character will match all stylistic and accented
|
|
|
variations of that character.
|
|
|
|
|
|
* Desired outcome of matching
|
|
|
|
|
|
TRUE: match succeeds if a string matches the match pattern.
|
|
|
FALSE: match fails if a string matches the match pattern.
|
|
|
|
|
|
A list of match specifications may be codified in a string formatted as
|
|
|
a vertical tab (VT) separated list of substrings as follows:
|
|
|
|
|
|
OptionString <VT> PatternString [ <VT> OptionString <VT> PatternString ...]
|
|
|
|
|
|
Non-empty option strings may contain only the following characters:
|
|
|
|
|
|
'r' - Match pattern is a regular expression [default]
|
|
|
'w' - Match pattern is a wildcard expression
|
|
|
's' - Match pattern is a simple substring
|
|
|
'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
|
|
|
'i' - Letter case variants are equivalent (e.g. case-insensitive)
|
|
|
'e' - All letter & number character variants are equivalent
|
|
|
'=' - Match succeeds if pattern matches [default]
|
|
|
'!' - Match fails if pattern matches (inverted match)
|
|
|
|
|
|
Options set in option string remain in effect until subsequently overridden.
|
|
|
|
|
|
While option strings may be empty, pattern strings may not be empty. Backslash
|
|
|
characters in pattern strings should be represented by "\\"; all other characters
|
|
|
should be specified literally.
|
|
|
|
|
|
The following is an example of a string representing a match specification list
|
|
|
intended to apply to file names
|
|
|
|
|
|
w.*ee*cr~$\\.[0-9]+
|
|
|
|
|
|
The corresponding match specification list will match as follows:
|
|
|
|
|
|
* All "dotfiles" would be matched with wildcard matching.
|
|
|
* All filenames beginning with an equivalent variant of the letter 'e'
|
|
|
(e.g. 'e','ê','Ě','E') would be matched with wildcard matching.
|
|
|
* All file names ending with '~' (e.g kwrite backup names) would be
|
|
|
matched with case-sensitive regex matching.
|
|
|
* All file names having a numeric digit filename suffix (e.g. wget backup
|
|
|
files) would be matched with case-sensitive regex matching.
|
|
|
|
|
|
Applications may set and get match specification lists either directly or
|
|
|
indirectly (as a match specification list string).The matching functions
|
|
|
provided are:
|
|
|
|
|
|
matchAny(): strings match if they match any pattern in list.
|
|
|
matchAll(): strings match only if the match all patterns in list.
|
|
|
|
|
|
IMPLEMENTATION NOTES:
|
|
|
|
|
|
* Wildcard match patterns are currently limited to POSIX wildcards.
|
|
|
Extended wildcard-like expressions are not currently supported
|
|
|
(e.g. Bash globstar, extglob, brace expansion).
|
|
|
|
|
|
* Wildcard match patterns are currently converted to regular
|
|
|
expressions and processed as such instead of using dedicated
|
|
|
functions such as fnmatch(3) or glob(3). This may change in
|
|
|
the future.
|
|
|
|
|
|
* Regular expressions are currently supported by TQRegExp and are
|
|
|
thereby subject to its limitations and bugs. This may be changed
|
|
|
in the future (e.g. direct access to pcre2(3), porting of Qt
|
|
|
QRegularExpression).
|
|
|
|
|
|
* Simple substrings are also supported as match patterns. These are
|
|
|
currently processed by the TQString.find() function. In the future,
|
|
|
these may be converted and processed by the underlying Regex engine,
|
|
|
depending on the tradeoff between code simplification and efficiency.
|
|
|
|
|
|
* Alphanumeric equivalence is conceptually similar to [=x=] POSIX
|
|
|
equivalence class bracket expressions (which are not supported)
|
|
|
but is intended to apply globally in patterns. The following
|
|
|
are caveats when this option is utilized:
|
|
|
|
|
|
- There is potentially significant overhead due to the fact that
|
|
|
match patterns and match strings must be converted prior to
|
|
|
matching. Conversion requires character-by-character lookup
|
|
|
and replacement using a pre-built table.
|
|
|
|
|
|
- The table contains equivalents for [0-9A-Z] which should work
|
|
|
well for Latin-derived languages. It also contains support for
|
|
|
other numeric and non-latin letter characters, the efficacy of
|
|
|
which is not as certain.
|
|
|
|
|
|
- Due to the 16-bit size limitation of TQChar, the table does not
|
|
|
contain mappings for codepoints greater than U+FFFF.
|
|
|
|
|
|
* The choice of VT as the match specification string separator was
|
|
|
based on the following considerations:
|
|
|
|
|
|
- It is a control character that is unlikely to occur in a pattern.
|
|
|
If it is desired to match the VT character, then that can be done
|
|
|
in a regular expression by specifying '\v'.
|
|
|
|
|
|
- Unlike the potentially more readable HT, TDEConfig will not attempt
|
|
|
to escape it when storing strings containing it.
|
|
|
|
|
|
- Unlike other control characters, files containing text with that
|
|
|
character are less likely to be misidentified as binary.
|
|
|
|
|
|
- Most common text editors (and `less`) represent that character as
|
|
|
a symbol that can be copied and pasted.
|
|
|
|
|
|
- Text containing that character displays predictably in a terminal.
|