Softpanorama May the source be with you, but remember the KISS principle ;-)	Home	Switchboard	Unix Administration	Red Hat	TCP/IP Networks	Neoliberalism	Toxic Managers
	(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix

Introduction to Perl 5.10 for Unix System Administrators

(Perl 5.10 without excessive complexity)

by Dr Nikolai Bezroukov

Contents : Foreword : Ch01 : Ch02 : Ch03 : Ch04 : Ch05 : Ch06 : Ch07 : Ch08 :

Prev | Up | Contents | Down | Next

5.2. Overview of Perl regular expressions

The Hello World Example
Two types of regex
Two Binding Operators (=~ and !~)
Success and Failure of Matching
Metacharacters
Examples of simple regex

The Hello World Example

As was mentioned before regular expressions are a language inside the language. Regex should be viewed as a separate language that has no direct connections to Perl. It is used with many other languages (Python, PHP, Java) in almost the same form as in Perl just with different syntactic sugar. Still Perl was the first language to introduce "close binding" of regex and the language per se, the feature that was later more or less successfully copied to Python, TCL and other languages. Also the level of integration of the regular expression language into main language is higher in Perl, then in any alternative scripting language. Still as it is a different language some problems arise. For example Perl debugger can't debug regular expressions.

Perl language regular expression parser gradually evolves. The latest significant changes were introduced in version 5.10 make it more powerful and less probe to errors. This version of Perl is the minimal version recommended for any serious text parsing work.

As regular expressions (regex for short) is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from shell/AWK legacy a regular expression lexically is a special type of literals (similar to double quoted literal).

It is usually (but not necessarily) is included in slashes. In matching operator the source string (where matching occurs) is specified on the left side of the special =~ operator (matching operator), while regex is on the right side.

The simplest case is to search substring in string like in built-in function index. The following expression is true if the string Hello appears anywhere in the variable $sentence.

$sentence = "Hello world";
if ($sentence =~ /Hello/) { 
   print "Matched\n"
} else {
   print "Not matched\n"
}

The regular expressions are case sensitive, so if we assign to $sentence the same string but in lower case

$sentence = "hello world";

then the above match will fail.

The operator !~ can be used for a non-match. For example, the expression

$sentence !~ /Hello/

is true if the string Hello does not appear in $sentence.

Alternatively you can use qr instead of slashes. That's very important, if you regex contain a lot of slashes

$url !~ qr(/cygdrive/f/public_html)

the $_ is the default operand for regular expressions. But in most cases the string against which to performs the match or substitution should be specified explicitly with operator =~ and its negation !~. For example:

$my_string = "The graph has many leaves";
if ( $my_string =~ m/graph/ ) {
   print("The source string contains the word 'graph'.\n");}
   $result =~ s/graph/tree/;
   print "Replaced with 'tree'\n";
}
print("initial string: '$my_string'\n.The result is '$result'\n");

In this example each of the regular expression operators applies to the $my_string variable instead of $_.

Two types of regex

There are two main uses for regular expressions in Perl:

matching: We already saw this form in the examples above. Expressions /regexp / or m{regex} (with m you can use so called alternative delimiters such as {} () or something else) indicates that the regular expression inside the the regular expression brackets (whatever they are) will be matched against the scalar on the left hand side of the =~ or !~.
If there is no string of the left side that matching is performed against the content of the default scalar variable $_ . For example
```
/Hello/
```
will search Hello in $_.
substitution: the form s/regexp/substitute_text/ indicates that the regular expression is going to be substituted by the string substitute__text. As syntactic sugar, you can leave s, but this would be an "excessive sugar" which just obscures Perl code and should be avoided. You can also use alternative brackets with s like with m, for example s{regex}{substitute_text} As in case of simple matching by default regular expression and substitution applies to the special variable $_.

Regular expressions in Perl operate only against strings. No arrays on left hand side of matching statement please.

Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.

Success and Failure of Matching

We can capture the success or failure of the match (but not the number of matches) in a scalar variable. This way we have a way to determine the success or failure of the matching and substitution, respectively:

@test_array=("The graph has many leaves",
             "Fallen leaves, so many leaves on the ground.");
foreach $test (@test_array) {
   $match = ($test =~ m/leaves/);
   print("Result of match of word 'leaves' in string '$test' is $match\n");
}

This program displays the following:

Result of match of word 'leaves' in string 'The graph has many leaves' is 1
Result of match of word 'leaves' in string 'Fallen leaves, so many leaves on the ground' is 1

The other useful feature of this example is that it shows you how to obtain the return values of the regular expression operators. In case subsequent action depends on the value of changed variables you should always check if the expression successive or failed because way to often regular expression behave differently then their creators expect.

In scalar context the match operation returns the number of matches. That means that if match failed it returns zero.

We could use a conditional as to check if match was successful or no:

$sentence = "Disneyworld in Orlando";
if ($sentence =~ /world/){
   print "there is a substring 'world' somewhere in the sentence: $sentence\n";
}

Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop. In this case you can write something like:

while (<>) { # get "Hello world" from the input stream
   if (/world/) {
      print "There is a word 'world' in the sentence '$_'\n";
   }
}

As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).

Regular Expressions Metacharacters

The problem with regex metacharacters is that there are plenty of them. They provide a lot of power for sophisticated user and at the same time make them appear very complicated, at least at the very beginning.

It's best to build up your skills slowly: creation of complex regex can be considered as a kind of an art form (like solving a a puzzle or chess problems). Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.

It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors. It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.

There are three types of metacharacters:

Regular metacharacters. Each of them represents a class of symbols. When they are matched they consume some characters from the string
Anchors. Thos signify special position in the string but matching them does not consume any characters.
Quantifiers. Those change the meaning of metacharacters

As they are used as metacharacters, characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded by a backslash.

For example:

$ip_addr=~/\d+\.\d+\.\d+\.\d+/; # dot character should be escaped

Regular metacharacters

Regular metacharacters are special characters that represent some class of symbols. They consume one character from the string if they are matched (with quantifiers it can be less or more). In other word, they 'eats' characters of the class they represent. A good example is metacharacter that consumes characters is . (dot) which match any character. Among the most common regular metacharacters are:

. Any single character except a newline (length one). There is a special modifier to force . match newline too
\d -- matches a digit (character grouping [0-9]). Equivalent to [0-9]
\w -- matches a word character (underscore is counted as a word character here). Equivalent to [a-zA-Z_0-9]
\s -- matches a 'space' character (tab, newline, space). Equivalent to [ \t\n\r\f]
Classes. Classes can be called "definable metacharacters". They are group of characters in square brackets. They are can be sets or ranges and should be put inside square brackets a -(minus) indicates "between" and a ^ after [ means "not". For for example:
- [AP] -- matches either letter A or letter P
- [0-9] -- marches digit form 0 to 9
- [0123456789ABCDF] match any hexadecimal digit
- [A-Z] matches capital letters
- [a-z] matches lower case letters
- [A-Za-z1-9_] -- equivalent to \w (note that symbol " _" is included)
- [abcde] # Either a or b or c or d or e
- [a-e] # same thing ("-" denote range here)
- [a-fx-z] # Anything from a to f inclusive and from x to z inclusive
- [^a-z] # Any non lower case letter
- /[a-zA-Z] # Any letter
- /[a-z]+/ # Any non-zero sequence of lower case letters
- /[01]/ # Either "0" or "1"

If you use capital latter instead of lower case letter the meaning of metacharacter is reversed:

\D -- matches a non-digit (character grouping [^0-9]
\W -- matches a non-word character (character grouping [^a-zA-Z0-9_]
\S -- matches a 'non-space' character (character grouping [^\t\n ]).
\B -- anchor that matches a lack of word boundary (\b).

Anchors

Anchors are metacharacters that serve as markers and that never consume characters from the string. Anchors always match zero number of characters of a particular class. That means that they do not require any character to be present, only some logical condition is this place of the string needs to be true. Anchors don't match a character, they match a condition. In other words they do not consume any symbols. They just tell the regex engine that the particular match occurred. Two most common anchors are ^ and $:

^ -- anchor which matches 'beginning of line' if placed at the beginning of a regular expression. So the regex /^Hello/ will match only if the word Hello is the first in the string and there are no blanks before it. Create a simple test and see this behaviour yourself.
$ Same of ^ but signify the end of the line. It is somewhat strange as in the US $ sign usually used as a prefix fro dollar amounts as in $15, but this probably originated in Canada :-)
b -- matches the word boundary (rarely used). B reverses the meaning of this anchor and has the meaning "anything but a word boundary".

Quantifiers

Perl has three groups of quantifiers (which are also metacharacters, but they affect interpretation of previous character). The most important metacharacters include three groups with two members in each - one greedy and the other non-greedy (lazy):

One or more of the last characters or group (length one of more)
- + -- greedy. Matches one or more of preceding characters, but try to grab as many characters as possible
- +? -- non greedy. Marches one or more preceding characters but try to grab minimum possible number of characters. Usually used with .(dot): .+? to search for the next occurrence of the string, for example:
  /(.+?)the/
Zero or more the last character or group (length zero or more)
- * -- greedy. Matches zero or more of preceding characters, but try to grab as many characters as possible
- *? -- non greedy. Matches one or more preceding characters but try to grab minimum possible number of characters
Zero or one the last character or group (length zero or one)
- ? -- greedy. Matches zero or one character
- ?? -- non greedy. Does not make much sense

Non greedy modifies are newer but easier to understand as they correspond to the search of substring, Greedy modifies correspond to search of the last occurrence of the substring. That's the key difference. We will discuss not greedy modifies in the next section: More Complex Perl Regular Expressions

For example:

$sentence="Hello world"; 
if ($sentence =~ /^\w+/) { # true if the sentence starts with a word like "Hello"  
   print "The string $sentence starts with a word\n";
} else {
    print "The string $sentence does not starts with a word\n";
}

Full list includes 12 quantifiers:

Maximal (greedy)	Minimal (lazy)	Allowed Range
`{`n,m`}`	`{`n,m`}?`	Must occur at least n times but no more than m times
`{`n`,}`	`{`n`,}?`	Must occur at least n times
`{`n`}`	`{`n`}?`	Must match exactly n times
`*`	`*?`	0 or more times (same as `{0,}`)
`+`	`+?`	1 or more times (same as `{1,}`)
`?`	`??`	0 or 1 time (same as `{0,1}`)

We will discuss additional quantifiers later

Examples

It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.

Here are a few examples:

$a = '404 - - ';
$a =~ /40\d/; # matches 400, 401, 403, 404 etc.

Here we took a fragment of a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer. A similar idea works for real, but generally real numbers have much more complex syntax:

$target='simple real number: 22.33';
$target=~/\d+\.\d*/;

Note: the regex /\d+\.\d*/ isn't a general enough to match all the real numbers permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.

Now let's try to match works. The simplest regular expression that matches a single word is \w+.Here is a couple of examples:

$target='hello world'; 
$target~ m{(\w+)\s+(\w+)}; # detecting two words separated by white space

$target='A = b';
$target =~ /(\w+)\s*=\s*(\w+)/; # another way to ignore white space in matching

Here are more examples of simple regular expressions that might be reused in other contexts:

t.t		 # t followed by any letter followed by t
	
^131		 # 131 at the beginning of a line
0$		 # 0 at the end of a line
\.txt$		 # .txt at the end of a line
/^newfile\.\w*$/ # newfile. with any  followed by zero or more arbitrary characters
                 # This will match newfile.txt, new_prg, newscript, etc.
/^.*marker/      # head of the string up and including the word "marker"
/marker.*$/	 # tail of the string starting from the 'market' and till the end (up to newline). 		
/^$/		 # An empty line

Several additional examples:

0		     # zero: "0"
0*		     # zero of more zeros		
0+		     # one or more zeros
0*0		     # same as above
\d		     # any digit but only one
\d+                  # any integer
\d+\.\d*             # a subset of real numbers. Please note that 0. is a real number
\d+\.\d+\.\d+\.\d+   # IP addresses starting (no control of the number of digits so 1000.1000.1000.1000 would match  this regex
/\d+\.\d+\.\d+\.255/ # IP addresses ending with 255

Tips:

If you need to match a word whose length is unknown, you probably should not use an * or *? because a zero length word makes no sense.
^$ matches the empty line.

Prev | Up | Contents | Down | Next

Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D

Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019