You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
tdelibs/tdecore/README.tdestringmatcher

132 lines
5.7 KiB
Plaintext

This file contains invisible Unicode characters!

This file contains invisible Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

The TDEStringMatcher class provides string matching against a list of one
or more match patterns along with associated options. A single pattern with
its associated options will be referred to herein as a "match specification".
Current match specification options include:
* Type of match pattern:
REGEX: Pattern is a regular expression whose syntax is
currently limited to that supported by the TQRegExp class.
WILDCARD: Pattern is a wildcard expression used in POSIX
shell file globbing.
SUBSTRING: Pattern is a simple substring that matches any
string in which it occurs. Substring characters do not
have any other meaning that controls matching.
* Alphanumeric character handling in a pattern:
NONE: each unescaped alphanumeric character in a pattern
is distinct and will match only itself.
CASE INSENSITIVE: each unescaped letter in a pattern
will match its lower and upper case variants.
EQUIVALENCE: Each unescaped variant of an alphanumeric
character will match all stylistic and accented
variations of that character.
* Desired outcome of matching
TRUE: match succeeds if a string matches the match pattern.
FALSE: match fails if a string matches the match pattern.
A list of match specifications may be codified in a string formatted as
a vertical tab (VT) separated list of substrings as follows:
OptionString <VT> PatternString [ <VT> OptionString <VT> PatternString ...]
Non-empty option strings may contain only the following characters:
'r' - Match pattern is a regular expression [default]
'w' - Match pattern is a wildcard expression
's' - Match pattern is a simple substring
'c' - Letter case variants are distinct (e.g. case-sensitive) [default]
'i' - Letter case variants are equivalent (e.g. case-insensitive)
'e' - All letter & number character variants are equivalent
'=' - Match succeeds if pattern matches [default]
'!' - Match fails if pattern matches (inverted match)
Options set in option string remain in effect until subsequently overridden.
While option strings may be empty, pattern strings may not be empty. Backslash
characters in pattern strings should be represented by "\\"; all other characters
should be specified literally.
The following is an example of a string representing a match specification list
intended to apply to file names
w .* e e* cr ~$ \\.[0-9]+
The corresponding match specification list will match as follows:
* All "dotfiles" would be matched with wildcard matching.
* All filenames beginning with an equivalent variant of the letter 'e'
(e.g. 'e','ê','Ě','') would be matched with wildcard matching.
* All file names ending with '~' (e.g kwrite backup names) would be
matched with case-sensitive regex matching.
* All file names having a numeric digit filename suffix (e.g. wget backup
files) would be matched with case-sensitive regex matching.
Applications may set and get match specification lists either directly or
indirectly (as a match specification list string).The matching functions
provided are:
matchAny(): strings match if they match any pattern in list.
matchAll(): strings match only if the match all patterns in list.
IMPLEMENTATION NOTES:
* Wildcard match patterns are currently limited to POSIX wildcards.
Extended wildcard-like expressions are not currently supported
(e.g. Bash globstar, extglob, brace expansion).
* Wildcard match patterns are currently converted to regular
expressions and processed as such instead of using dedicated
functions such as fnmatch(3) or glob(3). This may change in
the future.
* Regular expressions are currently supported by TQRegExp and are
thereby subject to its limitations and bugs. This may be changed
in the future (e.g. direct access to pcre2(3), porting of Qt
QRegularExpression).
* Simple substrings are also supported as match patterns. These are
currently processed by the TQString.find() function. In the future,
these may be converted and processed by the underlying Regex engine,
depending on the tradeoff between code simplification and efficiency.
* Alphanumeric equivalence is conceptually similar to [=x=] POSIX
equivalence class bracket expressions (which are not supported)
but is intended to apply globally in patterns. The following
are caveats when this option is utilized:
- There is potentially significant overhead due to the fact that
match patterns and match strings must be converted prior to
matching. Conversion requires character-by-character lookup
and replacement using a pre-built table.
- The table contains equivalents for [0-9A-Z] which should work
well for Latin-derived languages. It also contains support for
other numeric and non-latin letter characters, the efficacy of
which is not as certain.
- Due to the 16-bit size limitation of TQChar, the table does not
contain mappings for codepoints greater than U+FFFF.
* The choice of VT as the match specification string separator was
based on the following considerations:
- It is a control character that is unlikely to occur in a pattern.
If it is desired to match the VT character, then that can be done
in a regular expression by specifying '\v'.
- Unlike the potentially more readable HT, TDEConfig will not attempt
to escape it when storing strings containing it.
- Unlike other control characters, files containing text with that
character are less likely to be misidentified as binary.
- Most common text editors (and `less`) represent that character as
a symbol that can be copied and pasted.
- Text containing that character displays predictably in a terminal.