Table A-2. Metacharacters in regular expressions

Metacharacter Name Code Point Purpose
. Full Stop U+002E Match any character
\ Backslash U+005C Escape a character
| Vertical Bar U+007C Alternation (or)
^ Circumflex U+005E Beginning of a line anchor
$ Dollar Sign U+0024 End of a line anchor
? Question Mark U+003F Zero or one quantifier
* Asterisk U+002A Zero or more quantifier
+ Plus Sign U+002B One or more quantifier
[ Left Square Bracket U+005B Open character class
] Right Square Bracket U+005D Close character class
{ Left Curly Brace U+007B Open quantifier or block
} Right Curly Brace U+007D Close quantifier or block
( Left Parenthesis U+0028 Open group
) Right Parenthesis U+0029 Close group
- The hyphen is treated specially, as signifying a range, inside of the square brackets of a character class. Otherwise, it’s not special.

Table A-3. Character shorthands

Character Shorthand Description
\a Alert
\b Word boundary
[\b] Backspace character
\B Non-word boundary
\c x Control character
\d Digit character
\D Non-digit character
\dxxx Decimal value for a character
\f Form feed character(换页符)
\h Horizontal whitespace
\H Not horizontal whitespace
\r Carriage return
\n Newline character
\o xxx Octal value for a character
\s Space character
\S Non-space character
\t Horizontal tab character
\v Vertical tab character (whitespace)
\V Not vertical whitespace
\w Word character
\W Non-word character
\0 Null character
\x xx Hexadecimal value for a character
\u xxxx Unicode value for a character

What Is a Regular Expression?

Regular expressions are specially encoded text strings used as patterns for matching sets of strings.

http://www.regexpal.com /input here/g

707-827-7019

a string literal to match a string in the target text.

[0-9] Match any digit you find in the range 0 through 9

The square brackets are not literally matched because they are treated specially as metacharacters. A metacharacter has special meaning in regular expressions and is reserved. A regular expression in the form [0-9] is called a character class, or sometimes a character set.

\d will match all Arabic digits, just like [0-9].

This kind of regular expression is called a character shorthand. It is also called a character escape.

use an escaped uppercase D (\D), which matches any character that is not a digit.

The dot or period essentially acts as a wildcard and will match any character (except, in certain situations, a line ending). dotall option to make it possible to match a newline with a dot.

You’ll now match just a portion of the phone number using what is known as a capturing group. Then you’ll refer to the content of the group with a backreference. To create a capturing group, enclose a \d in a pair of parentheses to place it in a group, and then follow it with a \1 to backreference what was captured:

(\d)\d\1 can match 707

(\d)0\1\D\d\d\1\D\1\d\d\d can match 707-827-7019

\d{3}-?\d{3}-?\d{4}

The numbers in the curly braces tell the regex processor exactly how many occurrences of those digits you want it to look for. The braces with numbers are a kind of quantifier. The braces themselves are considered metacharacters.

The question mark (?) is another kind of quantifier. that there can be zero or one occurrence of the hyphen (one or none).

There are other quantifiers such as the plus sign (+), which means “one or more,”

The asterisk (*) which means “zero or more.”

1
(\d{3,4}[.-]?)+

• ( open a capturing group

• \ start character shorthand (escape the following character)

• d end character shorthand (match any digit in the range 0 through 9 with \d)

• { open quantifier

• 3 minimum quantity to match

• , separate quantities

• 4 maximum quantity to match

• } close quantifier

• [ open character class

• . dot or period (matches literal dot)

• - literal character to match hyphen

• ] close character class

• ? zero or one quantifier

• ) close capturing group

• + one or more quantifier

This all works, but it’s not quite right because it will also match other groups of 3 or 4 digits, whether in the form of a phone number or not. So let’s improve it a little:

1
(\d{3}[.-]?){2}\d{4}

Finally, here is a regular expression that allows literal parentheses to optionally wrap the first sequence of three digits, and makes the area code optional as well:

1
^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$

• ^ (caret) at the beginning of the regular expression, or following the vertical bar

(|), means that the phone number will be at the beginning of a line.

• ( opens a capturing group.

• ( is a literal open parenthesis.

• \d matches a digit.

• {3} is a quantifier that, following \d, matches exactly three digits.

• ) matches a literal close parenthesis.

• | (the vertical bar) indicates alternation, that is, a given choice of alternatives. In

other words, this says “match an area code with parentheses or without them.”

• ^ matches the beginning of a line.

• \d matches a digit.

• {3} is a quantifier that matches exactly three digits.

• [.-]? matches an optional dot or hyphen.

• ) close capturing group.

• ? make the group optional, that is, the prefix in the group is not required.

• \d matches a digit.

• {3} matches exactly three digits.

• [.-]? matches another optional dot or hyphen.

• \d matches a digit.

• {4} matches exactly four digits.

• $ matches the end of a line.

The capturing group in the above regular expression is not necessary. The group is necessary, but the capturing part is not. There is a better way to do this: a non-capturing group.

Simple Pattern Matching

http://gskinner.com/regexr

http://regexr.com/ Examples Community

2.1 Matching String Literals (By default, string matching is case-sensitive in Regexpal.)

2.2 Matching Digits \d [0-9] [01]

2.3 Matching Non-Digits \D is the same as a negated class [^0-9] or [^\d]

2.4 Matching Word and Non-Word Characters \w \W

The difference between \D and \w is that \D matches whitespace, punctuation, quotation

marks, hyphens, forward slashes, square brackets, and other similar characters, while \w does not—it matches letters numbers and underlined.

In English, \w matches essentially the same thing as the character class: [_a-zA-Z0-9]

\W matches whitespace, punctuation, and other kinds of characters that aren’t used in words in this example. the same thing as the character class: [^_a-zA-Z0-9]

try [^\w] and [^\W]

2.5 Matching Whitespace \s the same as characters class [ \t\n\r]

In other words, it matches Spaces, Tabs(\t), Line feeds(\n), Carriage returns(\r)

\S matches non-whitespace character, the same as characters class [^ \t\n\r]

Table 2-2. Character shorthands for whitespace characters

Character Shorthand Description
\f Form feed character(换页符)
\h Horizontal whitespace
\H Not horizontal whitespace
\r Carriage return
\n Newline character
\t Horizontal tab character
\v Vertical tab character (whitespace)
\V Not vertical whitespace

Not all whitespace shorthands work with all regex processors.

2.6 Matching Any Character

The dot matches all characters but line ending characters, except under certain circumstances.

\bA.{5}T\b

• The shorthand \b matches a word boundary, without consuming any characters.

• The characters A and T also bound the sequence of characters.

• .{5} matches any five characters.

• Match another word boundary with \b.

\b\w{7}\b

matches seven character words

.* the same as [^\n] or [^\n\r] 似乎并不一致

.+

The reason why it does this is because these quantifiers are greedy.

2.7 Marking Up the Text

In RegExr, click the Replace tab, (^T.*$)

then,

$1

The replacement regex surrounds the captured group, represented by $1, in an h1 element.

In most implementations, including Perl, you use this style: \1; but RegExr supports only $1, $2, $3 and so forth.

2.8 Using sed to Mark Up Text

echo Hello | sed s/Hello/Goodbye/

The s (substitute) command of sed then changes the word Hello to Goodbye, and Goodbye is displayed on your screen.

1
sed -n 's/^/<h1>/;s/$/<\/h1>/p;q' rime.txt

• The line starts by invoking the sed program.

• The -n option suppresses sed’s default behavior of echoing each line of input to the output. This is because you want to see only the line effected by the regex, that is, line 1.

• s/^/

/ places an h1 start-tag at the beginning (^) of the line.

• The semicolon (;) separates commands.

• s/$/</h1>/ places an h1 end-tag at the end ($) of the line.

• The p command prints the affected line (line 1). This is in contrast to -n, which echoes every line, regardless.

• Lastly, the q command quits the program so that sed processes only the first line.

• All these operations are performed against the file rime.txt.

Another way of writing this line is with the -e option. The -e option appends the editing commands, one after another.

1
sed -ne 's/^/<h1>/' -e 's/$/<\/h1>/p' -e 'q' rime.txt

You could also collect these commands in a file, as with h1.sed shown here

1
2
3
4
#!/usr/bin/sed
s/^/<h1>/
s/$/<\/h1>/
q

To run it, type:

1
sed -f h1.sed rime.txt

Boundaries

This chapter focuses on assertions. Assertions mark boundaries, but they don’t consume characters—that is, characters will not be returned in a result. They are also known as zero-width assertions. A zero-width assertion doesn’t match a character, perse, but rather a location in a string. Some of these, such as ^ and $, are also called anchors.

3.1 use anchors at the beginning or end of a line with ^ or $

to match the beginning of a line or string, use the caret or circumflex (U+005E):

To match the end of a line or string, use the dollar sign: $

^How.*Country.$ depend on whether check multiline or not

The dotall option means that the dot will match newlines in addition to all other characters.

3.2 use word boundaries and non-word boundaries

\b marks a word boundary.

Like, ^ or $, \b is a zero-width assertion. It may appear to match things like a space or the beginning of a line, but in actuality, what it matches is a zero-width nothing.

\B match non-word boundaries. \Be\B

another way for specifying a word boundary is with < for the beginning of a word, and with: > for the end of the word.

1
grep -Eoc '\<(THE|The|the)\>' rime.txt

The -E option indicates that you want to use extended regular expressions (EREs) rather than the basic regular expressions (BREs) which are used by grep by default. The -o option means you want to show in the result only that part of the line that matches the pattern, and the -c option means only return a count of the result.

grep -Eoc ‘(THE|The|the)’ rime.txt

Because the pattern will match only whole words, plus any sequence of characters that contain the word. So that is one reason why the < and > can come in handy.

3.3 match the beginning or end of a subject with \A and \Z (or \z)

pcregrep -c ‘\A\s*(THE|The|the)’ rime.txt

pcregrep -n ‘(MARINERE|Marinere)(.)?\Z’ rime.txt

3.4 quote strings as literals with \Q and \E

.^$*+?|(){}[]-

If you try to match those characters in the upper text box of RegExr, nothing will happen. Why? Because RegExr thinks (if it can think) that you are entering a regular expression, not literal characters.

\Q$\E

and it will match $ because anything between \Q and \E is interpreted as a literal character (see Figure 3-3). (Remember, you can precede a metacharacer with a \ to make it literal.)

3.5 add tags to a document with sed

The insert (i) command in sed allows you to insert text above or before a location in a document or a string. By the way, the opposite of i in sed is a, which appends text below or after a location. We’ll use the append command later.

The following command inserts the HTML5 doctype and several other tags, beginning at line 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sed '1 i\
<!DOCTYPE html>\
<html lang="en">\
<head>\
<title>Rime</title>\
</head>\
<body>
s/^/<h1>/
s/$/<\/h1>/
q' rime.txt

These same sed commands are saved in the file top.sed

1
sed -f top.sed rime.txt > temp

Alternation, Groups, and Backreferences

Groups surround text with parentheses to help perform some operation, such as the following:

• Performing alternation, a choice between two or more optional patterns

• Creating subpatterns

• Capturing a group to later reference with a backreference

• Applying an operation to a grouped pattern, such as a quantifer

• Using non-capturing groups

• Atomic grouping (advanced)

4.1 Alternation

alternation gives you a choice of alternate patterns to match.

(the|The|THE)

We can make this group shorter by applying an option. Options let you specify the way you would like to search for a pattern.

For example, the option: (?i) makes your pattern case-insensitive. (?i)the

The options don’t work with grep

1
grep -Ec "(the|The|THE)" rime.txt

• The -E option means that you want to use extended regular expressions (EREs)

rather than basic regular expressions (BREs). This, for example, saves you from

having to escape the parentheses and the vertical bar, like (THE|The|the), as

you must with BREs.

• The -c option returns a count of the matched lines (not matched words).

• The parentheses group the choice or alternation of the, The, or THE.

• The vertical bar separates possible choices, which are evaluated left to right.

To get a count of actual words used, this approach will return each occurrence of the word, one per line:

1
grep -Eo "(the|The|THE)" rime.txt | wc -l

• The -o option means to show only that part of the line that matches the pattern, though this is not apparent due to the pipe (|) to wc.

• The vertical bar, in this context, pipes the output of the grep command to the input of the wc command. wc is a word count command, and -l counts the number of lines of the input.

Because -c gives you a count of matching lines, but there can be more than one match on each line. If you use -o with wc -l, then each occurrence of the various forms of the word will appear on a separate line and be counted,

4.2 Subpatterns

subpatterns in regular expressions are referring to a group or groups within groups. A subpattern is a pattern within a pattern. Often, a condition in a subpattern is matchable when a preceding pattern is matched, but not always. Subpatterns can be designed in a variety of ways, but we’re concerned primarily with those defined within parentheses here.

(the|The|THE) has three subpatterns, the second pattern is not dependent on matching the first.

(t|T)h(e|eir) here is one where the subpattern(s) depend on the previous pattern.

In this case, the second subpattern (e|eir) is dependent on the first (tT).

Subpatterns don’t require parentheses. Here is an example of subpatterns done with character classes:

\b[tT]h[ceinry]*\b

4.3 Capturing Groups and Backreferences

When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. You can reuse that content if you wish by using a backreference, in the form: \1 or $1

where \1 or $1 reference the first captured group, \2 or $2 reference the second captured group, and so on. sed will only accept the \1 form.

1
sed -En 's/(It is) (an ancyent Marinere)/\2 \1/p' rime.txt

the output will be: an ancyent Marinere It is,

• The -E option once again invokes EREs, so you don’t have to quote the parentheses, for example.

• The -n option suppresses the default behavior of printing every line.

• The substitute command searches for a match for the text “It is an ancyent Marinere,” capturing it into two groups.

• The substitute command also replaces the match by rearranging the captured text in the output, with the backreference \2 first, then \1.

• The p at the end of the substitute command means you want to print the line.

Named groups are captured groups with names. You can access those groups by name later, rather than by integer.

perl -ne ‘print if s/(?It is) (?an ancyent Marinere)/\u$+{two} \l$+{one}/’ rime.txt

• Adding ? and ? inside the parentheses names the groups one and two, respectively.

• $+{one} references the group named one, and $+{two}, the group named two.

Table 4-3. Named group syntax

Syntax Description
(?…) A named group
(?name…) Another named group
(?P…) A named group in Python
\k{name} Reference by name in .NET
(?P=name) Reference by name in Python

4.4 Non-Capturing Groups

non-capturing groups don’t store their content in memory. Sometimes this is an advantage, especially if you never intend to reference the group.

(the|The|THE) You don’t need to backreference anything, so you could write a non-capturing group this way:

(?:the|The|THE)

add an option to make the pattern case-insensitive, (?i)(?:the) or (?:(?i)the)

the better one

(?i:the)

The option letter i can be inserted between the question mark and the colon.

Another kind of non-capturing group is the atomic group. If you are using a regex engine that does backtracking, this group will turn backtracking off, not for the entire regular expression but just for that part enclosed in the atomic group.

When would you want to use atomic groups? One of the things that can really slow regex processing is backtracking. The reason why is, as it tries all the possibilities, it takes time and computing resources. Sometimes it can gobble up a lot of time.

Character Classes

character classes sometimes called bracketed expressions. Character classes help you match specific characters, or sequences of specific characters.

But you can use character classes to be even more specific than that. In this way, they are more versatile than shorthands.

Now expand your horizon. If you wanted to match even numbers in the range 10 through 19, you could combine two character classes side by side, like this: \b[1][24680]\b

Or you could push things further and look for even numbers in the range 0 through 99 with this:

\b[24680]\b|\b[1-9][24680]\b

If you want to create a character class that matches hexadecimal digits, [a-fA-F0-9]

You can also use shorthands inside of a character class. [\w\s], the same as [_a-zA-Z \t\n\r]

5.1 Negated Character Classes [^aeiou]

5.2 Union and Difference

Character classes can act like sets. This functionality is not supported by all implementations.

If you wanted a union of two character sets, you could do it like this: [0-3[6-9]]

To match a difference (in essence, subtraction): [a-z&&[^m-r]]

which matches all the letters from a to z, except m through r

5.3 POSIX Character Classes

Character Class Description
[[:alnum:]] Alphanumeric characters (letters and digits)
[[:alpha:]] Alphabetic characters (letters)
[[:ascii:]] ASCII characters (all 128)
[[:blank:]] Blank characters
[[:ctrl:]] Control characters
[[:digit:]] Digits
[[:graph:]] Graphic characters
[[:lower:]] Lowercase letters
[[:print:]] Printable characters
[[:punct:]] Punctuation characters
[[:space:]] Whitespace characters
[[:upper:]] Uppercase letters
[[:word:]] Word characters
[[:xdigit:]] Hexadecimal digits

\6. Matching Unicode and Other Characters

You will have occasion to match characters or ranges of characters that are outside the scope of ASCII. ASCII, or the American Standard Code for Information Interchange, defines an English character set—the letters A through Z in upper- and lowercase, plus control and other characters. The 128-character Latinbased set was standardized.

But now it is dated, especially in light of the Unicode standard , which currently represents over 100,000 characters.

Matching a Unicode Character

match any Unicode character with \uxxxx or \xxx. The \u is followed by a hexadecimal value

match any Unicode character inside of vim using %xxx, %Xxx, %uxxxx, or %Uxxxx

6.2 Matching Characters with Octal Numbers

match characters in the range 0–255 using octal format with \000

6.3 Matching Unicode Character Properties

In some implementations, such as Perl, you can match on Unicode character properties. The properties include characteristics like whether the character is a letter, number, or punctuation mark.

Using ack on a command line, you can specify that you want to see all the characters whose property is Letter (L):

1
ack '\pL' schiller.txt

For lowercase letters, use Ll, surrounded by braces:

ack ‘\p{Ll}’ schiller.txt

For uppercase, it’s Lu:

ack ‘\p{Lu}’ schiller.txt

To specify characters that do not match a property, we use uppercase P:

ack ‘\PL’ schiller.txt

The following table lists character property names for use with \p{property} or \P{property}

Property Description Property Description
C
Cc
Cf
Cn
Co
Cs
L
Ll
Lm
Lo
Lt
Lu
L&
M
Mc
Me
Mn
N
Nd

6.4 Matching Control Characters

In regular expressions, you can specify a control character like this: \cx

where x is the control character you want to match.

Table 6-3. Matching Unicode and other characters

Code Description
\uxxxx Unicode (four places)
\xxx Unicode (two places)
\x{xxxx} Unicode (four places)
\x{xx} Unicode (two places)
\000 Octal (base 8)
\cx Control character
\0 Null
\a Bell
\e Escape
[\b] Backspace

Quantifiers

7.1 Greedy, Lazy, and Possessive

Quantifiers are, by themselves, greedy. A greedy quantifier first tries to match the whole string. It grabs as much as it can, the whole input, trying to make a match. If the first attempt to match the whole string goes awry, it backs up one character and tries again. This is called backtracking. It keeps backing up one character at a time until it finds a match or runs out of characters to try. It also keeps track of what it is doing, so it puts the most load on resources compared with the next two approaches.

A lazy (sometimes called reluctant) quantifier takes a different tack. It starts at the beginning of the target, trying to find a match. It looks at the string one character at a time, trying to find what it is looking for. At last, it will attempt to match the whole string. To get a quantifier to be lazy, you have to append a question mark (?) to the regular quantifier.

A possessive quantifier grabs the whole target and then tries to find a match, but it makes only one attempt. It does not do any backtracking. A possessive quantifier appends a plus sign (+) to the regular quantifier.

7.2 Matching with *, +, and ?

.* it would match, being greedy, all the characters (digits) in the subject text.

9.* lights up the row of nines and the row of zeros below it. Because Multiline is checked (at the bottom of the application window), the dot will match the newline character between the rows; normally, it would not.

Syntax Description
? Zero or one (optional)
+ One or more
* Zero or more

7.3 Matching a Specific Number of Times

Table 7-2. Summary of range syntax

Syntax Description
{n} Match n times exactly
{n,} Match n or more times
{m,n} Match m to n times
{0,1} Same as ? (zero or one)
{1,0} Same as + (one or more)
{0,} Same as * (zero or more)

7.4 Lazy Quantifiers

By nature, the lazy match matches as few characters as it can get away with. It’s a slacker.

5*? it won’t match anything either, because you gave it the option to match a minimum of zero times, and that’s what it does.

5{2,5}? Only two 5s are matched, not all five of them, as a greedy match would.

Syntax Description effect
?? Lazy zero or one (optional) zero
+? Lazy one or more one
*? Lazy zero or more zero
{n}? Lazy n n
{n,}? Lazy n or more n
{m,n}? Lazy m,n m

7.5 Possessive Quantifiers

A possessive match is like a greedy match, it grabs as much as it can get away with. But

unlike a greedy match: It does not backtrack.

Syntax Description
?+ Possessive zero or one (optional)
++ Possessive one or more
*+ Possessive zero or more
{n}+ Possessive n
{n,}+ Possessive n or more
{m,n}+ Possessive m,n

Lookarounds

Lookarounds are non-capturing groups that match patterns based on what they find

either in front of or behind a pattern. Lookarounds are also considered zero-width

assertions.

Lookarounds include:

• Positive lookaheads

• Negative lookaheads

• Positive lookbehinds

• Negative lookbehinds

8.1 Positive Lookaheads

Suppose you want to find every occurrence of the word ancyent that is followed by

‘ma’, can use a positive lookahead.

1
(?i)ancyent (?=marinere)

use the case-insensitive option (?i)

1
2
ack '(?i)ancyent (?=ma)' rime.txt
ack -i 'ancyent (?=ma)' rime.txt

8.2 Negative Lookaheads

The flip side of a positive lookahead is a negative lookahead. This means that as you

try to match a pattern, you won’t find a given lookahead pattern. A negative lookahead

is formed like this:

1
2
(?i)ancyent (?!marinere)
ack -i 'ancyent (?!marinere)' rime.txt

8.3 Positive Lookbehinds

A positive lookbehind looks to the left, in the opposite direction as a lookahead. The

syntax is:

1
2
(?i)(?<=ancyent) marinere
ack -i '(?<=ancyent) marinere' rime.txt

8.4 Negative Lookbehinds

It is looking to see if a pattern does not show up behind in the left-to-right stream of text. Again, it adds a less-than sign (<), reminding you which direction lookbehind is.

1
2
(?1)(?<!ancyent) marinere
ack -i '(?<!ancyent) marinere' rime.txt

Marking Up a Document with HTML

9.1 Matching Tags

match start-tags

1
<[_a-zA-Z][^>]*>

To match both start- and end-tags,

1
</?[_a-zA-Z][^>]*>

I’m sticking with start-tags only here. To refine the output, I often pipe in a few other tools to make it prettier:

1
grep -Eo '<[_a-zA-Z][^>]*>' lorem.dita | sort | uniq | sed 's/^<//;s/ id=\".*\"//;s/>$//'

9.2 Transforming Plain Text with sed

sed insert command (i)

1
2
3
4
5
sed '1 i\
<!DOCTYPE html>\
<html lang="en">\
<head>\
<title>The Rime of the Ancyent Marinere (1798)</title>\
\

</head>\

<body>\

q' rime.txt

9.2.1 Substitution with sed

sed finds the first line of the file and captures the entire line in a capturing group using escaped parentheses ( and ). sed needs to escape the parentheses used to capture a group unless you use the -E option (more on this in a moment). The beginning of the line is demarcated with ^, and the end of the line with a $. The backreference \1 pulls the captured text into the content of the title element, indented with one space.

sed '1s/^\(.*\)$/ <title>\1<\/title>/;q' rime.txt

THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.

sed -E '1s/^(.*)$/<!DOCTYPE html>\

<html lang="en">\

<head>\

<title>\1<\/title>\

<\/head>\

<body>\

<h1>\1<\/h1>\

/;q' rime.txt

THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.

THE RIME OF THE ANCYENT MARINERE, IN SEVEN PARTS.

9.2.2 Handling Roman Numerals with sed

The following line will use sed to capture that heading and those Roman numerals and surround them in h2 tags:

sed -En 's/^(ARGUMENT\.|I{0,3}V?I{0,2}\.)$/<h2>\1<\/h2>/p' rime.txt

The -E option gives you extended regular expressions, and the -n option suppresses

the printing of each line, which is sed’s default behavior.

• The substitute (s) command captures the heading and the seven uppercase

Roman numerals, each on separate lines and followed by a period, in the range I

through VII.

• The s command then takes each line of captured text and nestles it in an h2 element.

• The p flag at the end of the substitution prints the result to the screen.

90 |

9.2.3 Handling a Specific Paragraph with sed

finds a paragraph on line 5:

sed -En '5s/^([A-Z].*)$/<p>\1<\/p>/p' rime.txt

9.2.4 Handling the Lines of the Poem with sed

sed -E '9s/^[ ]*(.*)/ <p>\1<br\/>/;10,832s/^([ ]{5,7}.*)/\1<br\/>/;

833s/^(.*)/\1<\/p>/' rime.txt

These sed substitutions depend on line numbers to get their little jobs done. This

wouldn’t work with a generalized case, but it works quite well when you know exactly

what you are dealing with.

• On line 9, the first line of verse, the s command grabs the line and, after prepending

a few spaces, it inserts a p start-tag and appends a br (break) tag at the end of the line.

• Between lines 10 and 832, every line that begins with between 5 to 7 spaces gets a

br appended to it.

• On line 833, the last line of the poem, instead of a br, the s appends a p end-tag.

replace the blank lines with a br,

sed -E 's/^$/<br\/>/' rime.txt

The End of the Beginning

• ^ (caret) at the beginning of the regular expression, or following the vertical bar

(|), means that the phone number will be at the beginning of a line.

• ( opens a capturing group.

• ( is a literal open parenthesis.

• \d matches a digit.

• {3} is a quantifier that, following \d, matches exactly three digits.

• ) matches a literal close parenthesis.

• | (the vertical bar) indicates alternation, that is, a given choice of alternatives. In

other words, this says “match an area code with parentheses or without them.”

• ^ matches the beginning of a line.

• \d matches a digit.

• {3} is a quantifier that matches exactly three digits.

• [.-]? matches an optional dot or hyphen.

• ) close capturing group.

• ? make the group optional, that is, the prefix in the group is not required.

• \d matches a digit.

• {3} matches exactly three digits.

• [.-]? matches another optional dot or hyphen.

• \d matches a digit.

• {4} matches exactly four digits.

• $ matches the end of a line.

Reference

Introducing Regular Expressions