Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Introduction to Perl 5.10 for Unix System Administrators

(Perl 5.10 without excessive complexity)

by Dr Nikolai Bezroukov

Contents : Foreword : Ch01 : Ch02 : Ch03 : Ch04 : Ch05 : Ch06 : Ch07 : Ch08 :


Prev | UpContents | Down | Next

5.1 Introduction

Regular expressions are mini programming language for parsing text strings. This chapter provides an overview of regular expression from the point of view of a Unix administrator. Perl was the second after AWK language to incorporate regular expression as first class elements of the programming language, a new type of expressions. The level of integration into Perl of regular expressions was truly innovative feature of the language and as of today is still unmatched by any other language. Perl provided several innovative extensions of regular expressions that now became common. Most of then are included into POSIX.  As testament of Perl tremendous influence on the field let me remind you that most other scripting language use "Perl5 Compatible" regex engines, supporting such features as lazy quantifiers, non-capturing parentheses, inline mode modifiers, lookahead, and a readability mode innovated by Perl 5. As this is a very important feature of the language introduction will be long. Bear with us.

First of all any Unix administrator knows considerable subset of Perl regular expressions from day one. This is just because Perl regular expressions are an extension of so called extended regular expressions defined by POSIX and used in such utilities as AWK and grep.  Perl goes father then either AWL or grep in utilizing the power of regular expressions and the level integration of this feature into the language is one of important Perl advantages over other scripting languages. 

Like we stated before to make a scripting language attractive to system administrators you need two features: good integration with Unix API and a good debugger. Perl goes further then most other scripting languages on this road and as such is an excellent tool for Unix system administrators. The tools installed on all flavors of Unix by default and as such belonging to the set of standard Unix tools. 

As the topic is complex do not expect that the author can cover all the intricacies of Perl regex. This chapter like all the others reflects the author own experience with the language and due to the complexity of the language this experience always represent a subset of total capabilities.

Good command of regular expression represent a valuable skill beyond Perl. You can use them in editors (for example vim – never use vi, vim is much better)  and they also shared by many other UNIX utilities (egrep). Some Unix utilities like GNU grep have Perl compatible mode for regular expressions (option -p in GNU grep). 

Regex is one of the most useful features of Perl (but contrary to common advocacy line definitely not the most useful feature). Due to myriads of enhancements now regex became  a pretty powerful (and pretty obscure)  non-procedural notation for parsing strings. They are more flexible that string functions that we already studied and complement procedural string processing facilities of Perl that we already discussed. Generally everything that is achievable via regular expressions can be programmed using string functions, but regular expression in certain cases provides for much more compact solution.

Generally everything that is achievable via regular expressions can be programmed using string functions, but regular expression in certain cases provides for much more compact solution.

Regular expressions appear like random line noise to the Perl uninitiated. The number of arbitrary symbols that represent features seem to be infinite. At the same time experience with AWK and grep helps to understand that  that $ means "Anchor regex to the end of the line" and that ^ means "Anchor regex to the start of the line." I would like to stress that Perl regular expressions are a superset of extended regular expressions, defined by Posix and used other Unix utilities, including grep, find, vi, awk and sed.

It is important to understand that regular expressions in Perl are a language within the language and as soon as you are in a regular expression normal Perl rules are non-applicable. You should forget about Perl lexical and syntactical rules inside regular expression -- it's a different animal.

Regular expressions in Perl are a language within the language and as soon as you are in a regular expression normal Perl rules are non-applicable. You should forget about Perl lexical and syntactical rules inside regular expression -- it's a different animal

History

Regular expressions are mini programming language for parsing text strings that originated in Unix. Perl inherited most of the concepts from AWK which was the first language to integrate regular expressions for text processing. As Perl is more powerful and versatile language then AWK it popularized the power of regular expressions to wider audience and influenced the form in which they were introduced in other scripting languages such s PHP, Python, JavaScript, Ruby. The origin of this non-procedural (functional) notation is Unix editors, shell and utilities, so for Unix users they are quite natural extension of exiting procedural string manipulation facilities that we already discussed. Everybody else is in much less fortunate position. It you never have used Unix, than the closest relative of regular expressions would be so called masks in DOS/Windows (*.*, *.tx?, etc) and formats in Fortran, PL/1 and C. For example mask *.* that in DOS and Windows denote all files in the current directory is actually a primitive regular expression. All decent text editors (and best HTML editors) support searching using regular expressions too. As we already mentioned, traditionally regular expression functionality is a strong point of Unix command line shells and many Unix utilities accept regular expressions.

Regular expressions wee enhanced in Perl 5 and became standard de-facto on which other languages rely. Most other scripting language now provide what is called "Perl compatible regular expressions". Even some classic Unix utilities were retrofitted to provide this compatibility. One example is GNU option -p in GNU grep.

Regular expressions are still evolving and come a long way from a simple mechanism toward powerful non-procedural notation. As a result now studying regular expression represents certain challenge for newcomers.   The flipside of regular expressions is that they can notoriously misbehave if you don't have enough experience with them. So it's very important to practice to test complex regular expressions separately on as many examples as possible to ensure that they behave as expected.

The flipside of regular expressions is that they can notoriously misbehave if you don't have enough experience with them.  So it's very important to practice to test complex regular expressions separately on as many examples as possible to ensure that they behave as expected.

There are the two basic regular expression operators that Perl has: m (for matching) and s( for substitution ).  Any of them are applicable only to scalars (strings).

One of important advancement of Perl was that introduced so called regex readability form, which helps to debug complex regex.  It should be used as the only notation for all more or less complex regex.

Overview of literature

There are several tutorials of varing quality available on the Web. Some of them can help fill the gaps left in this book as we can't and will not try to cover everything. Advanced coverage of regular expresssions actually deserve a book of its own and such books exist. Among them.

Google books

At the same time I would like to warn you that many Perl authors belong to the church of overcomplexity and promote obscure ways of using Perl features, which Perl provides in abundance. You should resist this and generally ignore too complex idioms. If some idiom is difficult to understand it might well be difficult to debug and might have undesirable limitations and side effects. This in full measure is applicable to regular expressions.

As for beginning tutorials there are plenty of them.

The best book that cover advanced topics in regular expressions is Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly. Paradoxically  Friedl’s book is also a good example of what not to do with regular expressions and convincingly demonstrates  that attempts to replace lex and yacc with regular expressions are doomed to be a failure. That again shows that most Perl authors preach at the church of overcomplexity and that what we like to avoid. 

You need to know were to stop constructing more and more complex regex and find other solutions to the problem. This is as important as the knowledge of regular expressions itself. Understanding the limits of applicability of regular expressions is as important as understanding of their power.

You need to know were to stop and this is as important as the knowledge of regular expressions itself. Understanding the limits of applicability of regular expressions is as important as understanding of their power.

Some of the example provided in Jeffrey Friedl book including a double word problem and  matching comments in C are perfect examples of what not to do with regular expressions. For example, in case of double words, converting text into array of words with a pipe and then checking the stream for two identical words in a better and much cleaner solution then usage of regex. 

In case of comments one should try to construct a lexical analyzer (possible with automated lex analyser generators such as flex) or procedural string functions. Actually in both cases regex-based solution can be more complex than solution using string functions.

So along with the knowledge when to use regular expressions you need to obtain knowledge when not to use them and when procedural way of dealing with strings is simpler, clearer and more efficient. Functions like index and substr in many cases provide a good substitute to the regular expressions and it make sense to used them in all such cases as regular expression is more powerful construct. You never should imitate a general who send a motorized division to capture a village with few unarmed natives.

Prev | UpContents | Down | Next



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019