|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
Articles
Perl.com Apocalypse 5 Interesting discussion of shortcomings of the Regex culture
Regex culture has gone wrong in a variety of ways, but it's not my intent to assign blame--there's plenty of blame to go around, and plenty of things that have gone wrong that are nobody's fault in particular. For example, it's nobody's fault that you can't realistically complement a character set anymore. It's just an accident of the way Unicode defines combining characters. The whole notion of character classes is mutating, and that will have some bearing on the future of regular expression syntax.
Given all this, I need to warn you that this Apocalypse is going to be somewhat radical. We'll be proposing changes to certain "sacred" features of regex culture, and this is guaranteed to result in future shock for some of our more conservative citizens. Do not be alarmed. We will provide ways for you to continue programming in old-fashioned regular expressions if you desire. But I hope that once you've thought about it a little and worked through some examples, you'll like most of the changes we're proposing here.
So although the RFCs did contribute greatly to my thinking for this Apocalypse, I'm going to present my own vision first for where regex culture should go, and then analyze the RFCs with respect to that vision.
First, let me enumerate some of the things that are wrong with current regex culture.
I'm sure there are other problems, but that'll do for starters. Let's look at each of these in more detail.
Most of the other problems stem from trying to deal with a rich history. Now there's nothing wrong with history per se, but those of us who are doomed to repeat it find that many parts of history are suboptimal and contradictory. Perl has always tried to err on the side of incorporating as much history as possible, and sometimes Perl has succeeded in that endeavor.
Cultural continuity has much to be said for it, but what can you do when the culture you're trying to be continuous with is itself discontinuous? As it says in Ecclesiastes, there's a time to build up, and a time to tear down. The first five versions of Perl mostly built up without tearing down, so now we're trying to redress that omission.
Regular expressions were invented by computational linguists
who love to write examples like /aa*b*(cd)*ee/
. While these are
conducive to reasoning about pattern matching in the abstract, they aren't so
good for pattern matching in the concrete. In real life, most atoms are longer
than "a
" or "b
". In real life, tokens are more recognizable
if they are separated by whitespace. In the abstract, /a+/
is reducible
to /aa*/
. In real life, nobody wants to repeat a 15 character token
merely to satisfy somebody's idea of theoretical purity. So we have shortcuts
like the +
quantifier to say "one or more".
Now, you may rightly point out that +
is something
we already have, and we already introduced /x
to allow whitespace,
so why is this bullet point here? Well, there's a lot of inertia in culture,
and the problem with /x
is that it's not the default, so people
don't think to turn it on when it would probably do a lot of good. The culture
is biased in the wrong direction. Whitespace around tokens should be the norm,
not the exception. It should be acceptable to use whitespace to separate tokens
that could be confused. It should not be considered acceptable to define new
constructs that contain a plethora of punctuation, but we've become accustomed
to constructs like (?<=...)
and (??{...})
and
[\r\n\ck\p{Zl}\p{Zp}]
, so we don't complain. We're frogs who are
getting boiled in a pot full of single-character morphemes, and we don't notice.
Huffman invented a method of data compaction in which common characters are represented by a small number of bits, and rarer characters are represented by more bits. The principle is more general, however, and language designers would do well to pay attention to the "other" Perl slogan: Easy things should be easy, and hard things should be possible. However, we haven't always taken our own advice. Consider those two regex constructs we just saw:
(?<=...) (??{...})
Which one do you think is likely to be the most common in everyday use? Guess which one is longer...
There are many examples of poor Huffman coding in current regexes. Consider these:
(...) (?:...)
Is it really the case that grouping is rarer than capturing? And by two gobbledygooky character's worth? Likewise there are many constructs that are the same length that shouldn't be:
(?:...) (?#...)
Grouping is much more important than the ability to embed a comment. Yet they're the same length currently.
A lot of our Huffman troubles came about because we were trying
to shoehorn new capabilities into an old syntax without breaking anything. The
(?...)
construct succeeded at that goal, but it was new wine in
old wineskins, as they say. More successful was the *?
minimal
matching hack, but it's still symptomatic of the problem that we only had three
characters to choose from that would have worked at that point in the grammar.
We've pretty nearly exhausted the available backslash sequences.
The waterbed theory of linguistic complexity says that if you push down one place, it goes up somewhere else. If you arbitrarily limit yourself to too few metacharacters, the complexity comes out somewhere else. So it seems obvious to me that the way out of this mess is to grab a few more metacharacters. And the metacharacters I want to grab are...well, we'll see in a moment.
Consider these constructs:
(??{...}) (?{...}) (?#...) (?:...) (?i:...) (?=...) (?!...) (?<=...) (?<!...) (?>...) (?(...)...|...)
These all look quite similar, but some of them do radically
different things. In particular, the (?<...)
does not mean the
opposite of the (?>...)
. The underlying visual problem is the overuse
of parentheses, as in Lisp. Programs are more readable if different things look
different.
In linguistics, the notion of end-weight is the idea that people tend to prefer sentences where the short things come first and the long things come last. That minimizes the amount of stuff you have to remember while you're reading or listening. Perl violates this with regex modifiers. It's okay when you say something short like this:
s/foo/bar/g
But when you say something like we find in RFC 360:
while ($text =~ /name:\s*(.*?)\n\s* children:\s*(?:(?@\S+)[, ]*)*\n\s* favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {...}
it's not until you read the /sigx
at the end
that you know how to read the regex. This actually causes problems for the Perl
5 parser, which has to defer parsing the regular expression till it sees the
/x
, because that changes how whitespace and comments work.
The /s
modifier in the previous example changes
the meaning of the .
metacharacter. We could, in fact, do away
with the /s
modifier entirely if we only had two different representations
for "any character", one of which matched a newline, and one which didn't. A
similar argument applies to the /m
modifier. The whole notion of
something outside the regex changing the meaning of the regex is just a bit
bogus, not because we're afraid of context sensitivity, but because we need
to have better control within the regex of what we mean, and in this case the
context supplied outside the regex is not precise enough. (Perl 5 has a way
to control the inner contexts, but it uses the self-obfuscating (?...)
notation.)
Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.
Without knowing the context, you cannot know what the pattern
//
will do. It might match a null string, or it might match the
previously successful match.
The local
operator behaves differently inside
regular expressions than it does outside.
It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:
/ | foo | bar | baz /x
Even when it's intentional, it may not look intentional:
(a|b|c|)
That's hard to read because it's difficult to make the absence of something visible.
It's too easy to confuse the multiple meanings of dot. Or
the multiple meanings of ^
, and $
. And the opposite
of \A
is frequently not \Z
, but \z
. Tell
me again, when do I say \1
, and when do I say $1
?
Why are they different?
Speaking of \1
, backreferences have a number
of shortcomings. The first is actually getting ahold of the right backreference.
Since captures are numbered from the beginning, you have to count, and you can
easily count wrong. For many purposes it would be better if you could ask for
the last capture, or the one before that. Or perhaps if there were a way to
restart the numbering part way through...
Another major problem with backreferences is that you can't
easily modify one to search for a variant. Suppose you match an opening parenthesis,
bracket, or curly. You'll like to search for everything up to the corresponding
closing parenthesis, bracket, or curly, but there's no way to transmogrify the
opening version to the closing version, because the backref search is hardwired
independently of ordinary variable matching. And that's because Perl doesn't
instantiate $1
soon enough. And that's because Perl relies on variable
interpolation to get subexpressions into regexes. Which leads us to...
Since regexes undergo an interpolation pass before they're
compiled, anything you interpolate is forced to be treated as a regular expression.
Often that's not what you want, so we have the klunky \Q$string\E
mechanism to hide regex metacharacters. And that's because...
The problem with \Q$string\E
arises because of
the fundamental mistake of using interpolation to build regexes instead of letting
the regex control how it treats the variables it references. Regexes aren't
strings, they're programs. Or, rather, they're strings only in the sense that
any piece of program is a string. Just as you have to work to eval a string
as a program, you should have to work to eval a string as a regular expression.
Most people tend to expect a variable in a regular expression to match its contents
literally. Perl violates that expectation. And because it violates that expectation,
we can't make $1
synonymous with \1
. And interpolated
parentheses throw off the capture count, so you can't easily use interpolation
to call subrules, so we invented (??{$var})
to get around that.
But then you can't actually get at the parentheses captured by the subrule.
The ramifications go on and on.
Historically, regular expressions were considered a very low-level
language, a kind of glorified assembly language for the regex engine. When you're
only dealing with ASCII, there is little need for abstraction, since the shortest
way to say [a-z]
is just that. With the advent of the eighth bit,
we started getting into a little bit of trouble, and POSIX started thinking
about names like [:alpha:]
to deal with locale difficulties. But
as with the problem of conciseness, the culture was still biased away from naming
abstractly anything that could be expressed concretely.
However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.
It's difficult to deal with any subset of Unicode without
naming it. These days, if you see [a-z]
in a program, it's probably
an outright bug. It's much better to use a named character property so that
your program will work right in areas that don't just use ASCII.
Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:
(??{$rule})
That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.
Make that "no support" in Perl, unless you include assignment
to a list. This is just a part of the bias against naming things. Instead we
are forced to number our capturing parens and count. That works okay for the
top-level regular expression, when we can do list assignment or assign
$1
to $foo
. But it breaks down as soon as you start trying
to use nested regexes. It also breaks down when the capturing parentheses match
more than once. Perl handles this currently by returning only the last match.
This is slightly better than useless, but not by much.
For many of the reasons we've mentioned, it's difficult to make regexes refer to each other, and even if you do, it's almost impossible to get the nested information back out of them. And there are entire classes of parsing problems that are not solvable without recursive definitions.
Even if it were easier for regexes to refer to other regexes, we'd still have the problem that those other regexes aren't organized in any meaningful way. They might be off in variables that come and go at the whim of the surrounding context.
When we have an organized system of parsing rules, we call it a grammar. One advantage of having a grammar is that you can optimize based on the assumption that the rules maintain their relationship to each other. For instance, if you think of grammar rules as a funny kind of subroutine, you can write an optimizer to inline some of the subrules--but only if you know the subrule is fixed in the grammar.
Without support for grammar classes, there's no decent way to think of deriving one grammar from another. And if you can't derive one grammar from another, you can't easily evolve your language to handle new kinds of problems.
If we want to have variant grammars for Perl dialects, then what about regex dialects? Can regexes be extended either at compile time or at run time? Perl 5 has some rudimentary overloading magic for rewriting regex strings, but that's got the same problems as source filters for Perl code; namely that you just get the raw regex source text and have to parse it yourself. Once again the fundamental assumption is that a regex is a funny kind of string, existing only at the behest of the surrounding program.
Do we think of regexes as a real, living language?
Let's face it, in the culture of computing, regex languages are mostly considered second-class citizens, or worse. "Real" languages like C and C++ will exploit regexes, but only through a strict policy of apartheid. Regular expressions are our servants or slaves; we tell them what to do, they go and do it, and then they come back to say whether they succeeded or not.
At the other extreme, we have languages like Prolog or Snobol where the pattern matching is built into the very control structure of the language. These languages don't succeed in the long run because thinking about that kind of control structure is rather difficult in actual fact, and one gets tired of doing it constantly. The path to freedom is not to make everyone a slave.
However, I would like to think that there is some happy medium between those two extremes. Coming from a C background, Perl has historically treated regexes as servants. True, Perl has treated them as trusted servants, letting them move about in Perl society better than any other C-like language to date. Nevertheless, if we emancipate regexes to serve as co-equal control structures, and if we can rid ourselves of the regexist attitudes that many of us secretly harbor, we'll have a much more productive society than we currently do. We need to empower regexes with a sense of control (structure). It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.
Perl 5 started to give regexes more control of their own destiny
with the "grab" construct, (?>...)
, which tells the regex engine
that when it fails to match the rest of the pattern, it should not backtrack
into the innards of the grab, but skip back to before it. That's a useful notion,
but there are problems. First, the notation sucks, but you knew that already.
Second, it doesn't go far enough. There's no way to backtrack out of just the
current grouping. There's no way to backtrack out of just the current rule.
Both of these are crucial for giving first-class status to the control flow
of regexes.
Notionally, a regex is an organization of assertions that either succeed or fail. Some assertions are easily expressed in traditional regex language, while others are more easily expressed in a procedural language like Perl.
The natural (but wrong) solution is to try to reinvent Perl expressions within regex language. So, for instance, I'm rejecting those RFCs that propose special assertion syntax for numerics or booleans. The better solution is to make it easier to embed Perl assertions within regexes.
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last updated: March 12, 2019