notebook/notes/posix/regexp.md

14 KiB

title TARGET DECK FILE TAGS tags
Regular Expressions Obsidian::STEM regexp
regexp

Overview

The following ERE (Extended Regular Expression) operators were defined to achieve consistency between programs like grep, sed, and awk. In POSIX, regexps are greedy.

%%ANKI Cloze Regular expressions are either {greedy} or {lazy}. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic Are POSIX regexps greedy or lazy? Back: Greedy. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What does it mean for a regexp to be greedy? Back: The regexp matches as many characters as it can. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What does it mean for a regexp to be lazy? Back: The regexp matches as few characters as it can. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What is the POSIX ERE standard? Back: The Extended Regular Expression standard. A standard based off of regexps accepted by egrep. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

  • . matches any single character.
    • There exist application-specific exclusions. For instance, newlines and the NUL character are often ignored.

%%ANKI Cloze The {.} operator matches {any single character}. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What two common exclusions are made with .? Back: Newlines and the NUL character. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

  • [...], the bracket expression, matches any enclosed character.
    • An optional - can be included to denote a range.
    • - is treated literally if its the first or last specified character.
    • ] is treated literally if its the first specified character.
    • ^ complements the set if its the first specified character.

%%ANKI Basic What name is given to the [...] operator? Back: The bracket expression. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What three characters are interpreted specially in a bracket expression? Back: ^, -, and ] Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic When is - interpreted literally in a bracket expression? Back: When it is the first or last specified character. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic When is ^ interpreted literally in a bracket expression? Back: When it is not the first specified character. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic When is ] interpreted literally in a bracket expression? Back: When it is the first specified character. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

  • ^ is the leading anchor. It matches the starting position of a string.
  • $ is the trailing anchor. It matches the ending position of a string.

%%ANKI Cloze The {^} operator matches {the starting position of a string}. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Cloze The {$$} operator matches {the ending position of a string}. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic ^ and $ belong to what operator category? Back: Anchors Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

  • * matches the preceding element zero or more times.
  • + matches the preceding element one or more times.
  • ? matches the preceding element zero or one times.

%%ANKI Basic What does the * operator do? Back: Matches the preceding element zero or more times. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What does the + operator do? Back: Matches the preceding element one or more times. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

%%ANKI Basic What does the ? operator do? Back: Matches the preceding element zero or one times. Reference: “POSIX Basic Regular Expressions,” accessed February 4, 2024, https://en.wikibooks.org/wiki/Regular_Expressions/POSIX_Basic_Regular_Expressions.

END%%

  • {n}, an interval expression, matches the preceding element n times.
    • {n,} matches the preceding element at least n times.
    • {n,m} matches the preceding element between n and m times.
    • Interval expressions cannot contain repetition counts > 255. Results are otherwise undefined.

%%ANKI Basic What name is given to the e.g. {n,m} operator? Back: The interval expression. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What does the {n} operator do? Back: Matches the preceding element n times. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What does the {n,} operator do? Back: Matches the preceding element at least n times. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What does the {n,m} operator do? Back: Matches the preceding element between n and m times. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What interval expression repetition counts lead to undefined behavior? Back: Counts greater than 255. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

  • | is the alternation operator. It allows specifying match alternatives.

%%ANKI Basic What name is given to the e.g. | operator? Back: The alternation operator. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic What does the | operator do? Back: Matches different regexp alternatives. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic Which regexp operator has the least precedence? Back: | Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

Character Classes

Notation for describing a class of characters specific to a given locale/character set.

%%ANKI Basic What inconsistency do character classes introduce? Back: Matching characters are dependent on locale/character set. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic How are character classes denoted? Back: [:class:] Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

Class Similar To Meaning
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:alpha:] [A-Za-z] Alphabetic characters
[:blank:] [ \t] ' ' and TAB characters
[:cntrl:] Control characters
[:digit:] [0-9] Numeric characters
[:graph:] [^ [:cntrl:]] Printable and visible characters
[:lower:] [a-z] Lowercase alphabetic characters
[:print:] [ [:graph:]] Printable characters
[:punct:] All graphic characters except letters and digits
[:space:] [ \t\n\r\f\v] Whitespace characters
[:upper:] [A-Z] Uppercase alphabetic characters
[:xdigit:] [0-9A-Fa-f] Hexadecimal digits

%%ANKI Basic Generally speaking, what is a printable character? Back: Characters that can be displayed on screen or printed on paper. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic Is 'a' (i.e. the letter a) printable and/or visible? Back: It is printable and visible. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic Is ' ' (i.e. the space character) printable and/or visible? Back: It is printable but not visible. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

%%ANKI Basic Is '\t' (i.e. the tab character) printable and/or visible? Back: It is neither printable nor visible. Reference: Robbins, Arnold D. “GAWK: Effective AWK Programming,” October 2023. https://www.gnu.org/software/gawk/manual/gawk.pdf

END%%

References