332 lines
15 KiB
Markdown
332 lines
15 KiB
Markdown
---
|
|
title: Regular Expressions
|
|
TARGET DECK: Obsidian::STEM
|
|
FILE TAGS: computability::regexp
|
|
tags:
|
|
- computability
|
|
- regexp
|
|
---
|
|
|
|
## Overview
|
|
|
|
TODO
|
|
|
|
## Extensions
|
|
|
|
The following ERE (**E**xtended **R**egular **E**xpression) operators were defined to achieve consistency between programs like `grep`, `sed`, and `awk`. In POSIX, regexps are greedy.
|
|
|
|
%%ANKI
|
|
Cloze
|
|
Regular expressions are either {greedy} or {lazy}.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707231745948-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
Are POSIX regexps greedy or lazy?
|
|
Back: Greedy.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707231745951-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does it mean for a regexp to be greedy?
|
|
Back: The regexp matches as many characters as it can.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707231745952-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does it mean for a regexp to be lazy?
|
|
Back: The regexp matches as few characters as it can.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707231745954-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What is the POSIX ERE standard?
|
|
Back: The **E**xtended **R**egular **E**xpression standard. A standard based off of regexps accepted by `egrep`.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923589-->
|
|
END%%
|
|
|
|
* `.` matches any single character.
|
|
* There exist application-specific exclusions. For instance, newlines and the `NUL` character are often ignored.
|
|
|
|
%%ANKI
|
|
Cloze
|
|
The {`.`} operator matches {any single character}.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923593-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What two common exclusions are made with `.`?
|
|
Back: Newlines and the `NUL` character.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923596-->
|
|
END%%
|
|
|
|
* `[...]`, the **bracket expression**, matches any enclosed character.
|
|
* An optional `-` can be included to denote a range.
|
|
* `-` is treated literally if its the first or last specified character.
|
|
* `]` is treated literally if its the first specified character.
|
|
* `^` complements the set if its the first specified character.
|
|
|
|
%%ANKI
|
|
Basic
|
|
What name is given to the `[...]` operator?
|
|
Back: The bracket expression.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923600-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What three characters are interpreted specially in a bracket expression?
|
|
Back: `^`, `-`, and `]`
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923605-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
When is `-` interpreted literally in a bracket expression?
|
|
Back: When it is the first or last specified character.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923610-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
When is `^` interpreted literally in a bracket expression?
|
|
Back: When it is not the first specified character.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923615-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
When is `]` interpreted literally in a bracket expression?
|
|
Back: When it is the first specified character.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923621-->
|
|
END%%
|
|
|
|
* `^` is the leading anchor. It matches the starting position of a string.
|
|
* `$` is the trailing anchor. It matches the ending position of a string.
|
|
|
|
%%ANKI
|
|
Cloze
|
|
The {`^`} operator matches {the starting position of a string}.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923629-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Cloze
|
|
The {`$$`} operator matches {the ending position of a string}.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923635-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
`^` and `$$` belong to what operator category?
|
|
Back: Anchors.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923643-->
|
|
END%%
|
|
|
|
* `*` matches the preceding element zero or more times.
|
|
* `+` matches the preceding element one or more times.
|
|
* `?` matches the preceding element zero or one times.
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `*` operator do?
|
|
Back: Matches the preceding element zero or more times.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923650-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
How is the `*` operator written equivalently as an interval expression?
|
|
Back: `{0,}`
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707654685031-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `+` operator do?
|
|
Back: Matches the preceding element one or more times.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923656-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
How is the `+` operator written equivalently as an interval expression?
|
|
Back: `{1,}`
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707654685034-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `?` operator do?
|
|
Back: Matches the preceding element zero or one times.
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707050923662-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
How is the `?` operator written equivalently as an interval expression?
|
|
Back: `{0,1}`
|
|
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
<!--ID: 1707654685036-->
|
|
END%%
|
|
|
|
* `{n}`, an **interval expression**, matches the preceding element `n` times.
|
|
* `{n,}` matches the preceding element at least `n` times.
|
|
* `{n,m}` matches the preceding element between `n` and `m` times.
|
|
* Interval expressions cannot contain repetition counts `> 255`. Results are otherwise undefined.
|
|
|
|
%%ANKI
|
|
Basic
|
|
What name is given to the e.g. `{n,m}` operator?
|
|
Back: The interval expression.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923669-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `{n}` operator do?
|
|
Back: Matches the preceding element `n` times.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923676-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `{n,}` operator do?
|
|
Back: Matches the preceding element at least `n` times.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923683-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `{n,m}` operator do?
|
|
Back: Matches the preceding element between `n` and `m` times.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923689-->
|
|
END%%
|
|
|
|
* `|` is the **alternation operator**. It allows specifying match alternatives.
|
|
|
|
%%ANKI
|
|
Basic
|
|
What name is given to the e.g. `|` operator?
|
|
Back: The alternation operator.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923701-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
What does the `|` operator do?
|
|
Back: Matches different regexp alternatives.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923708-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
Which regexp operator has the least precedence?
|
|
Back: `|`
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923713-->
|
|
END%%
|
|
|
|
### Character Classes
|
|
|
|
Notation for describing a class of characters specific to a given locale/character set.
|
|
|
|
%%ANKI
|
|
Basic
|
|
What portability issue do character classes introduce?
|
|
Back: Matching characters are dependent on locale/character set.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923719-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
How are character classes denoted?
|
|
Back: `[:class:]`
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923724-->
|
|
END%%
|
|
|
|
| Class | Similar To | Meaning |
|
|
| ------------ | --------------- | ------------------------------------------------ |
|
|
| `[:alnum:]` | `[A-Za-z0-9]` | Alphanumeric characters |
|
|
| `[:alpha:]` | `[A-Za-z]` | Alphabetic characters |
|
|
| `[:blank:]` | `[ \t]` | `' '` and `TAB` characters |
|
|
| `[:cntrl:]` | | Control characters |
|
|
| `[:digit:]` | `[0-9]` | Numeric characters |
|
|
| `[:graph:]` | `[^ [:cntrl:]]` | Printable and visible characters |
|
|
| `[:lower:]` | `[a-z]` | Lowercase alphabetic characters |
|
|
| `[:print:]` | `[ [:graph:]]` | Printable characters |
|
|
| `[:punct:]` | | All graphic characters except letters and digits |
|
|
| `[:space:]` | `[ \t\n\r\f\v]` | Whitespace characters |
|
|
| `[:upper:]` | `[A-Z]` | Uppercase alphabetic characters |
|
|
| `[:xdigit:]` | `[0-9A-Fa-f]` | [[radices#Hexadecimal\|Hexadecimal]] digits |
|
|
|
|
%%ANKI
|
|
Basic
|
|
Generally speaking, what is a printable character?
|
|
Back: Characters that can be displayed on screen or printed on paper.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923728-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
Is `'a'` (i.e. the letter *a*) printable and/or visible?
|
|
Back: It is printable and visible.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923732-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
Is `' '` (i.e. the space character) printable and/or visible?
|
|
Back: It is printable but not visible.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923736-->
|
|
END%%
|
|
|
|
%%ANKI
|
|
Basic
|
|
Is `'\t'` (i.e. the tab character) printable and/or visible?
|
|
Back: It is neither printable nor visible.
|
|
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|
|
<!--ID: 1707050923740-->
|
|
END%%
|
|
|
|
## Bibliography
|
|
|
|
* “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
|
|
* Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
|