Views

Difference between revisions of "Regular Expressions"

The Wiki of Unify contains information on clients and devices, communications systems and unified communications. - Unify GmbH & Co. KG is a Trademark Licensee of Siemens AG.

Jump to: navigation, search
(Elements of a regular expression)
Line 5: Line 5:
 
== Elements of a regular expression ==
 
== Elements of a regular expression ==
  
* Anchors: Assert the start and end position of a line. ^ (caret) matches the start and $ (dollar sign) matches the end.
+
=== Literal characters ===
* Character class: Enclosed in square brackets [ ], defines a set of characters to match. For example, [aeiou] matches any vowel.
+
Most characters simply match themselves. The letter '''a''' matches the letter "a" in the text.
* Capturing group: Parentheses ( ) are used to create groups, used to treat multiple characters or subpatterns as a single unit.
+
 
* Quantifiers: Specify the number of occurrences of the preceding character, character class or group. Common quantifiers include * (zero or more), + (one or more), ? (zero or one), and {} (exact number or range).
+
=== Anchors (position markers) ===
* Alternation: | (pipe). It allows you to specify alternatives, matching either the pattern on the left or the one on the right.
+
Anchors do not match a character, they match a ''position'' in the text.
* Negation: ^ (caret). Used inside a character class. Matches any character not listed in the character class. [^aeiou] matches any character that is not a vowel.
+
* <code>^</code> (caret) matches the start of a line.
* Escape character: \ (backslash). It is used to escape a special character, allowing you to match it as a literal. Also used for encoded characters, e.g. \x20 matches a white space character.
+
* <code>$</code> (dollar sign) matches the end of a line.
* Special character: . (dot) matches any character except new line, \w matches any word character, \d matches any digit, \s matches any whitespace character.
+
* <code>\b</code> — matches a word boundary (the position between a word character and a non-word character).
 +
 
 +
=== Character classes ===
 +
Enclosed in square brackets <code>[ ]</code>, a character class matches '''one''' character from a defined set.
 +
* <code>[aeiou]</code> — matches any single vowel.
 +
* <code>[a-z]</code> — matches any lowercase letter from a to z (range).
 +
* <code>[0-9a-fA-F]</code> — matches any hexadecimal digit.
 +
 
 +
=== Negated character classes ===
 +
A caret <code>^</code> placed immediately inside a character class negates it, matching any character '''not''' in the set.
 +
* <code>[^aeiou]</code> — matches any character that is not a vowel.
 +
* <code>[^0-9]</code> — matches any character that is not a digit.
 +
 
 +
=== Special (shorthand) character classes ===
 +
Predefined shortcuts for common character sets.
 +
* <code>.</code> (dot) — matches any single character except a newline.
 +
* <code>\d</code> — matches any digit (same as <code>[0-9]</code>).
 +
* <code>\D</code> — matches any non-digit.
 +
* <code>\w</code> — matches any word character: letter, digit, or underscore (same as <code>[a-zA-Z0-9_]</code>).
 +
* <code>\W</code> — matches any non-word character.
 +
* <code>\s</code> — matches any whitespace character (space, tab, newline).
 +
* <code>\S</code> — matches any non-whitespace character.
 +
 
 +
=== Quantifiers (repetition) ===
 +
Specify how many times the preceding element must occur.
 +
* <code>*</code> — zero or more times.
 +
* <code>+</code> — one or more times.
 +
* <code>?</code> — zero or one time (makes the element optional).
 +
* <code>{n}</code> — exactly ''n'' times.
 +
* <code>{n,}</code> — ''n'' or more times.
 +
* <code>{n,m}</code> — between ''n'' and ''m'' times (inclusive).
 +
 
 +
=== Greedy vs. lazy quantifiers ===
 +
By default, quantifiers are '''greedy''' — they match as much text as possible. Adding a <code>?</code> after a quantifier makes it '''lazy''' (matches as little as possible).
 +
* <code>.*</code> — greedy: matches as many characters as possible.
 +
* <code>.*?</code> — lazy: matches as few characters as possible.
 +
 
 +
=== Groups ===
 +
Parentheses <code>( )</code> group multiple characters or sub-patterns into a single unit. Groups can be quantified, and their matched content can be referenced later.
 +
* '''Capturing group''': <code>(abc)</code> — matches "abc" and remembers the match for later use (back-reference or replacement).
 +
* '''Non-capturing group''': <code>(?:abc)</code> — groups the pattern without remembering the match (useful for performance or clarity).
 +
* '''Named group''': <code>(?P&lt;name&gt;abc)</code> or <code>(?&lt;name&gt;abc)</code> — a capturing group accessible by name instead of number.
 +
 
 +
=== Back-references ===
 +
Refer back to the content matched by a previous capturing group.
 +
* <code>\1</code> — matches the same text that was matched by the first capturing group.
 +
* <code>\2</code> — matches the same text as the second group, and so on.
 +
 
 +
=== Alternation ===
 +
The pipe <code>|</code> acts as an "or" operator, matching either the pattern on the left or the pattern on the right.
 +
* <code>cat|dog</code> — matches "cat" or "dog".
 +
* <code>(red|blue) car</code> — matches "red car" or "blue car".
 +
 
 +
=== Escape character ===
 +
The backslash <code>\</code> removes the special meaning of the following character, allowing it to be matched literally.
 +
* <code>\.</code> — matches a literal dot (instead of "any character").
 +
* <code>\[</code> — matches a literal opening bracket.
 +
* <code>\\</code> — matches a literal backslash.
 +
 
 +
It is also used for encoded characters:
 +
* <code>\n</code> — newline.
 +
* <code>\t</code> — tab.
 +
* <code>\r</code> — carriage return.
 +
* <code>\x20</code> — the character with hexadecimal code 20 (a space).
 +
 
 +
=== Lookahead and lookbehind (assertions) ===
 +
These check whether a pattern exists before or after the current position, '''without consuming''' any characters.
 +
* <code>(?=abc)</code> — '''positive lookahead''': succeeds if "abc" follows.
 +
* <code>(?!abc)</code> — '''negative lookahead''': succeeds if "abc" does '''not''' follow.
 +
* <code>(?<=abc)</code> — '''positive lookbehind''': succeeds if "abc" precedes.
 +
* <code>(?<!abc)</code> — '''negative lookbehind''': succeeds if "abc" does '''not''' precede.
 +
 
 +
=== Flags (modifiers) ===
 +
Flags change how the entire expression behaves. They are typically placed after the closing delimiter (e.g. <code>/pattern/gi</code>).
 +
* <code>i</code> — case-insensitive matching.
 +
* <code>g</code> — global: find all matches, not just the first.
 +
* <code>m</code> — multiline: <code>^</code> and <code>$</code> match the start/end of each line, not just the whole string.
 +
* <code>s</code> — single-line (dotall): <code>.</code> also matches newline characters.
  
 
== Examples ==
 
== Examples ==

Revision as of 09:46, 18 May 2026

Introduction

Regular expressions (often called regex or regexp) are powerful sequences of characters that define a search pattern. They're used for string matching within text, allowing you to search and match strings based on a specified pattern. A regular expression may contain literals or special characters with a predefined meaning.

Elements of a regular expression

Literal characters

Most characters simply match themselves. The letter a matches the letter "a" in the text.

Anchors (position markers)

Anchors do not match a character, they match a position in the text.

  • ^ (caret) — matches the start of a line.
  • $ (dollar sign) — matches the end of a line.
  • \b — matches a word boundary (the position between a word character and a non-word character).

Character classes

Enclosed in square brackets [ ], a character class matches one character from a defined set.

  • [aeiou] — matches any single vowel.
  • [a-z] — matches any lowercase letter from a to z (range).
  • [0-9a-fA-F] — matches any hexadecimal digit.

Negated character classes

A caret ^ placed immediately inside a character class negates it, matching any character not in the set.

  • [^aeiou] — matches any character that is not a vowel.
  • [^0-9] — matches any character that is not a digit.

Special (shorthand) character classes

Predefined shortcuts for common character sets.

  • . (dot) — matches any single character except a newline.
  • \d — matches any digit (same as [0-9]).
  • \D — matches any non-digit.
  • \w — matches any word character: letter, digit, or underscore (same as [a-zA-Z0-9_]).
  • \W — matches any non-word character.
  • \s — matches any whitespace character (space, tab, newline).
  • \S — matches any non-whitespace character.

Quantifiers (repetition)

Specify how many times the preceding element must occur.

  • * — zero or more times.
  • + — one or more times.
  • ? — zero or one time (makes the element optional).
  • {n} — exactly n times.
  • {n,}n or more times.
  • {n,m} — between n and m times (inclusive).

Greedy vs. lazy quantifiers

By default, quantifiers are greedy — they match as much text as possible. Adding a ? after a quantifier makes it lazy (matches as little as possible).

  • .* — greedy: matches as many characters as possible.
  • .*? — lazy: matches as few characters as possible.

Groups

Parentheses ( ) group multiple characters or sub-patterns into a single unit. Groups can be quantified, and their matched content can be referenced later.

  • Capturing group: (abc) — matches "abc" and remembers the match for later use (back-reference or replacement).
  • Non-capturing group: (?:abc) — groups the pattern without remembering the match (useful for performance or clarity).
  • Named group: (?P<name>abc) or (?<name>abc) — a capturing group accessible by name instead of number.

Back-references

Refer back to the content matched by a previous capturing group.

  • \1 — matches the same text that was matched by the first capturing group.
  • \2 — matches the same text as the second group, and so on.

Alternation

The pipe | acts as an "or" operator, matching either the pattern on the left or the pattern on the right.

  • cat|dog — matches "cat" or "dog".
  • (red|blue) car — matches "red car" or "blue car".

Escape character

The backslash \ removes the special meaning of the following character, allowing it to be matched literally.

  • \. — matches a literal dot (instead of "any character").
  • \[ — matches a literal opening bracket.
  • \\ — matches a literal backslash.

It is also used for encoded characters:

  • \n — newline.
  • \t — tab.
  • \r — carriage return.
  • \x20 — the character with hexadecimal code 20 (a space).

Lookahead and lookbehind (assertions)

These check whether a pattern exists before or after the current position, without consuming any characters.

  • (?=abc)positive lookahead: succeeds if "abc" follows.
  • (?!abc)negative lookahead: succeeds if "abc" does not follow.
  • (?<=abc)positive lookbehind: succeeds if "abc" precedes.
  • (?<!abc)negative lookbehind: succeeds if "abc" does not precede.

Flags (modifiers)

Flags change how the entire expression behaves. They are typically placed after the closing delimiter (e.g. /pattern/gi).

  • i — case-insensitive matching.
  • g — global: find all matches, not just the first.
  • m — multiline: ^ and $ match the start/end of each line, not just the whole string.
  • s — single-line (dotall): . also matches newline characters.

Examples

Some example regular expressions to be used within Openscape Endpoint Management.

IP address range

The following examples can be used for matching IP address ranges

Regular Expression Description
192\.168\.0\.((2[5-9])|(3[0-9])) Starting at 192.168.0.25 until 192.168.0.39
192\.168\.1\.[0-9]{1,3} All addresses within subnet 192.168.1.0/24

Device types

The following examples can be used for matching specific device types

Regular Expression Description
CP[67].* Matches CP600, CP700, CP700X and CP710
^CP700$ Matches CP700 but not CP700X

Testing your regular expressions

If you want to test your regular expression, there are plenty of websites that allow you to do this online.