notebook/notes/computability/regexp.md

332 lines
15 KiB
Markdown
Raw Permalink Normal View History

2024-02-04 12:50:19 +00:00
---
title: Regular Expressions
TARGET DECK: Obsidian::STEM
2024-12-27 02:51:09 +00:00
FILE TAGS: computability::regexp
2024-02-04 12:50:19 +00:00
tags:
2024-12-27 02:51:09 +00:00
- computability
2024-02-04 12:50:19 +00:00
- regexp
---
## Overview
TODO
## Extensions
The following ERE (**E**xtended **R**egular **E**xpression) operators were defined to achieve consistency between programs like `grep`, `sed`, and `awk`. In POSIX, regexps are greedy.
%%ANKI
Cloze
Regular expressions are either {greedy} or {lazy}.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707231745948-->
END%%
%%ANKI
Basic
Are POSIX regexps greedy or lazy?
Back: Greedy.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707231745951-->
END%%
%%ANKI
Basic
What does it mean for a regexp to be greedy?
Back: The regexp matches as many characters as it can.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707231745952-->
END%%
%%ANKI
Basic
What does it mean for a regexp to be lazy?
Back: The regexp matches as few characters as it can.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707231745954-->
END%%
2024-02-04 12:50:19 +00:00
%%ANKI
Basic
What is the POSIX ERE standard?
Back: The **E**xtended **R**egular **E**xpression standard. A standard based off of regexps accepted by `egrep`.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923589-->
END%%
* `.` matches any single character.
* There exist application-specific exclusions. For instance, newlines and the `NUL` character are often ignored.
%%ANKI
Cloze
The {`.`} operator matches {any single character}.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923593-->
END%%
%%ANKI
Basic
What two common exclusions are made with `.`?
Back: Newlines and the `NUL` character.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923596-->
END%%
* `[...]`, the **bracket expression**, matches any enclosed character.
* An optional `-` can be included to denote a range.
* `-` is treated literally if its the first or last specified character.
* `]` is treated literally if its the first specified character.
* `^` complements the set if its the first specified character.
%%ANKI
Basic
What name is given to the `[...]` operator?
Back: The bracket expression.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923600-->
END%%
%%ANKI
Basic
What three characters are interpreted specially in a bracket expression?
Back: `^`, `-`, and `]`
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923605-->
END%%
%%ANKI
Basic
When is `-` interpreted literally in a bracket expression?
Back: When it is the first or last specified character.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923610-->
END%%
%%ANKI
Basic
When is `^` interpreted literally in a bracket expression?
Back: When it is not the first specified character.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923615-->
END%%
%%ANKI
Basic
When is `]` interpreted literally in a bracket expression?
Back: When it is the first specified character.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923621-->
END%%
* `^` is the leading anchor. It matches the starting position of a string.
* `$` is the trailing anchor. It matches the ending position of a string.
%%ANKI
Cloze
The {`^`} operator matches {the starting position of a string}.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923629-->
END%%
%%ANKI
Cloze
The {`$$`} operator matches {the ending position of a string}.
2024-02-04 12:50:19 +00:00
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923635-->
END%%
%%ANKI
Basic
`^` and `$$` belong to what operator category?
Back: Anchors.
2024-02-04 12:50:19 +00:00
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923643-->
END%%
* `*` matches the preceding element zero or more times.
* `+` matches the preceding element one or more times.
* `?` matches the preceding element zero or one times.
%%ANKI
Basic
What does the `*` operator do?
Back: Matches the preceding element zero or more times.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923650-->
END%%
2024-02-11 12:33:02 +00:00
%%ANKI
Basic
How is the `*` operator written equivalently as an interval expression?
Back: `{0,}`
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707654685031-->
END%%
2024-02-04 12:50:19 +00:00
%%ANKI
Basic
What does the `+` operator do?
Back: Matches the preceding element one or more times.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923656-->
END%%
2024-02-11 12:33:02 +00:00
%%ANKI
Basic
How is the `+` operator written equivalently as an interval expression?
Back: `{1,}`
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707654685034-->
END%%
2024-02-04 12:50:19 +00:00
%%ANKI
Basic
What does the `?` operator do?
Back: Matches the preceding element zero or one times.
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707050923662-->
END%%
2024-02-11 12:33:02 +00:00
%%ANKI
Basic
How is the `?` operator written equivalently as an interval expression?
Back: `{0,1}`
Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
<!--ID: 1707654685036-->
END%%
2024-02-04 12:50:19 +00:00
* `{n}`, an **interval expression**, matches the preceding element `n` times.
* `{n,}` matches the preceding element at least `n` times.
* `{n,m}` matches the preceding element between `n` and `m` times.
* Interval expressions cannot contain repetition counts `> 255`. Results are otherwise undefined.
%%ANKI
Basic
What name is given to the e.g. `{n,m}` operator?
Back: The interval expression.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923669-->
END%%
%%ANKI
Basic
What does the `{n}` operator do?
Back: Matches the preceding element `n` times.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923676-->
END%%
%%ANKI
Basic
What does the `{n,}` operator do?
Back: Matches the preceding element at least `n` times.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923683-->
END%%
%%ANKI
Basic
What does the `{n,m}` operator do?
Back: Matches the preceding element between `n` and `m` times.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923689-->
END%%
* `|` is the **alternation operator**. It allows specifying match alternatives.
%%ANKI
Basic
What name is given to the e.g. `|` operator?
Back: The alternation operator.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923701-->
END%%
%%ANKI
Basic
What does the `|` operator do?
Back: Matches different regexp alternatives.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923708-->
END%%
%%ANKI
Basic
Which regexp operator has the least precedence?
Back: `|`
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923713-->
END%%
### Character Classes
2024-02-04 12:50:19 +00:00
Notation for describing a class of characters specific to a given locale/character set.
%%ANKI
Basic
What portability issue do character classes introduce?
2024-02-04 12:50:19 +00:00
Back: Matching characters are dependent on locale/character set.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923719-->
END%%
%%ANKI
Basic
How are character classes denoted?
Back: `[:class:]`
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923724-->
END%%
| Class | Similar To | Meaning |
| ------------ | --------------- | ------------------------------------------------ |
| `[:alnum:]` | `[A-Za-z0-9]` | Alphanumeric characters |
| `[:alpha:]` | `[A-Za-z]` | Alphabetic characters |
| `[:blank:]` | `[ \t]` | `' '` and `TAB` characters |
| `[:cntrl:]` | | Control characters |
| `[:digit:]` | `[0-9]` | Numeric characters |
| `[:graph:]` | `[^ [:cntrl:]]` | Printable and visible characters |
| `[:lower:]` | `[a-z]` | Lowercase alphabetic characters |
| `[:print:]` | `[ [:graph:]]` | Printable characters |
| `[:punct:]` | | All graphic characters except letters and digits |
| `[:space:]` | `[ \t\n\r\f\v]` | Whitespace characters |
| `[:upper:]` | `[A-Z]` | Uppercase alphabetic characters |
| `[:xdigit:]` | `[0-9A-Fa-f]` | [[radices#Hexadecimal\|Hexadecimal]] digits |
2024-02-04 12:50:19 +00:00
%%ANKI
Basic
Generally speaking, what is a printable character?
Back: Characters that can be displayed on screen or printed on paper.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923728-->
END%%
%%ANKI
Basic
Is `'a'` (i.e. the letter *a*) printable and/or visible?
Back: It is printable and visible.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923732-->
END%%
%%ANKI
Basic
Is `' '` (i.e. the space character) printable and/or visible?
Back: It is printable but not visible.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923736-->
END%%
%%ANKI
Basic
Is `'\t'` (i.e. the tab character) printable and/or visible?
Back: It is neither printable nor visible.
Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)
<!--ID: 1707050923740-->
END%%
## Bibliography
2024-02-04 12:50:19 +00:00
* “POSIX Basic Regular Expressions,” accessed February 4, 2024, [https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions](https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions).
* Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. [https://www.gnu.org/software/gawk/manual/gawk.pdf](https://www.gnu.org/software/gawk/manual/gawk.pdf)