The CybOX Language supports the definition of observable patterns that leverage regular expressions for criteria to be evaluated against instance data. To support interoperability, the CybOX Language defines an intersection of regular expression character classes, operations, expressions, and other lexical tokens defined by common regular expression syntaxes (Boost, Java, JavaScript, .NET, PCRE, Perl 5, and Python) to be used when crafting regular expressions.
This set of capabilities was identified through a survey of several regular expression libraries in an effort to increase interoperability between pattern authors and evaluators. Pattern authors are encouraged to leverage the capabilities defined below when creating regular expressions, but are not required to do so. Deviation from these regular expression capabilities is discussed later in this section.
Regular expression modifiers, like the “m”, “i”, “s”, and “x” modifiers used by Perl are not supported. Character matching assumes a Unicode character set. Note that no syntax is supplied for specifying code points in hex; actual Unicode characters must be used instead.
The following regular expression capabilities are identified as recommended in the CybOX Language. Regular expression capabilities that are not listed below should be avoided if possible as they are likely to be incompatible or have different meanings between commonly used regular expression libraries.
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
*? Match 0 or more times
+? Match 1 or more times
?? Match 0 or 1 time
{n}? Match exactly n times
{n,}? Match at least n times
{n,m}? Match at least n but not more than m times
\t tab (HT, TAB)
\n newline (LF, NL)
\r return (CR)
\f form feed (FF)
\033 octal char (think of a PDP-11)
\x1B hex char
\c[ control char
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
\b Match a word boundary
\B Match a non-(word boundary)
(?:regexp) - Group without capture
(?=regexp) - Zero-width positive lookahead assertion
(?!regexp) - Zero-width negative lookahead assertion
[chars] - Match any of the specified characters
[^chars] - Match anything that is not one of the specified characters
[a-b] - Match any character in the range between "a" and "b", inclusive
a|b - Alternation; match either the left side of the "|" or the right side
\n - When 'n' is a single digit: the nth capturing group matched.
The capabilities outlined above are recommended, but are not required for the creation of observable
patterns which leverage regular expressions. If a pattern author requires capabilities that are outside of
the set defined above, the @regex_syntax
attribute can be leveraged to identify the regular expression
syntax, library, or software required to properly evaluate it (e.g., “PCRE v8.32” or “Java 1.6”). By setting
the @regex_syntax
attribute, the author accepts any adverse effects caused by the deviation with
respect to interoperability and compatibility. Omitting this attribute or setting it with an empty value
declares alignment with the CybOX Language Specification.