POSIX regular Expressions
Regex Tutorial - 
POSIX Bracket Expressions
POSIX regular expression in the standardization of capabilities of regular expression engine used 
in grep and AWK. So two engines are standardized
   - The POSIX Basic Regular Expression (BRE) engine. Default on command line.  Also 
   used in grep when it is invoked without any option (bad idea). grep -P now implements 
   Perl compatible  regular expression which are a step up from ERE, to say nothing about BRE 
   and should be new default. Meanwhile you can create alias to avoid using "classic" grep.
   
 
   - The POSIX Extended Regular Expression (ERE) engine. This a slight generalization of 
   regular expression engine  used in 
   AWK. GNU AWK can be  views as the reference implementation.  grep with option  -E 
   (or called as egrep) uses this engine. Again it does not make sense to learn its idiosyncrasies. 
   Use grep -P instead.  
 
POSIX introduced "bracket expressions" which are a special kind of
character classes. POSIX 
bracket expressions match one character out of a set of characters, just like regular character classes. 
They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates 
the bracket expression.
One key syntactic difference is that the backslash is NOT a metacharacter in a POSIX bracket expression. 
Unlike Perl compatible regular expression (PCRE) in POSIX, the regular expression [\d] matches a \ or a d. 
you need to use [:digit:] to achieve the same effect, which is frustrating.  To match a
], put it as the first character after the opening [ or the negating ^. To 
match a -, put it right before the closing ]. To match a ^, put it before 
the final literal - or the closing ]. Put together, []\d^-] matches ],
\, d, ^ or -.
The main purpose of the bracket expressions is that they adapt to the user's or application's locale. 
A locale is a collection of rules and settings that describe language and cultural conventions, like 
sort order, date format, etc. The POSIX standard also defines these locales.
POSIX-compliant 
regular expression engines should implement POSIX bracket expressions. Some non-POSIX 
regex engines also support POSIX character classes, but usually don't support collating sequences and character 
equivalents. Regular expression engines that support
Unicode use Unicode properties 
and scripts to provide functionality similar to POSIX bracket expressions. In Unicode regex engines,
shorthand character 
classes like \w normally match all relevant Unicode characters, alleviating the need to 
use locales.
Character Classes
Don't confuse the POSIX term "character class" with what is normally called a
regular expression character 
class. [x-z0-9] is an example of what we call a "character class" and POSIX calls a "bracket 
expression". [:digit:] is a POSIX character class, used inside a bracket expression like
[x-z[:digit:]]. These two regular expressions match exactly the same: a single character that 
is either x, y, z or a digit. The class names must be written all lowercase.
POSIX bracket expressions can be negated. [^x-z[:digit:]] matches a single character that 
is not x, y, z or a digit. A major difference between POSIX bracket expressions and the character classes 
in other regex flavors is that POSIX bracket expressions treat the backslash as a literal character. 
This means you can't use backslashes to escape the closing bracket (]), the caret (^) and the hyphen 
(-). To include a caret, place it anywhere except right after the opening bracket. [x^] matches 
an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret.
[]x] matches a closing bracket or an x. [^]x] matches any character that is not a 
closing bracket or an x. The hyphen can be included right after the opening bracket, or right before 
the closing bracket, or right after the negating caret. Both [-x] and [x-] match an 
x or a hyphen.
Exactly which POSIX character classes are available depends on the POSIX locale. The following are 
usually supported, often also by regex engines that don't support POSIX itself. I've also indicated 
equivalent character classes that you can use in ASCII and
Unicode regular expressions 
if the POSIX classes are unavailable. Some classes also have Perl-style
shorthand 
equivalents.
Java does not support POSIX 
bracket expressions, but does support POSIX character classes using the \p operator. Though 
the \p syntax is borrowed from the syntax for
Unicode properties, the 
POSIX classes in Java only match ASCII characters as indicated below. The class names are case sensitive. 
Unlike the POSIX syntax which can only be used inside a bracket expression, Java's \p can be 
used inside and outside bracket expressions.
	
		| POSIX | 
		Description | 
		ASCII | 
		Unicode | 
		Shorthand | 
		Java | 
	
	
		| [:alnum:] | 
		Alphanumeric characters | 
		[a-zA-Z0-9] | 
		[\p{L&}\p{Nd}] | 
		  | 
		\p{Alnum} | 
	
	
		| [:alpha:] | 
		Alphabetic characters | 
		[a-zA-Z] | 
		\p{L&} | 
		  | 
		\p{Alpha} | 
	
	
		| [:ascii:] | 
		ASCII characters | 
		[\x00-\x7F] | 
		\p{InBasicLatin} | 
		  | 
		\p{ASCII} | 
	
	
		| [:blank:] | 
		Space and tab | 
		[ \t] | 
		[\p{Zs}\t] | 
		  | 
		\p{Blank} | 
	
	
		| [:cntrl:] | 
		Control characters | 
		[\x00-\x1F\x7F] | 
		\p{Cc} | 
		  | 
		\p{Cntrl} | 
	
	
		| [:digit:] | 
		Digits | 
		[0-9] | 
		\p{Nd} | 
		\d | 
		\p{Digit} | 
	
	
		| [:graph:] | 
		Visible characters (i.e. anything except spaces, control characters, etc.) | 
		[\x21-\x7E] | 
		[^\p{Z}\p{C}] | 
		  | 
		\p{Graph} | 
	
	
		| [:lower:] | 
		Lowercase letters | 
		[a-z] | 
		\p{Ll} | 
		  | 
		\p{Lower} | 
	
	
		| [:print:] | 
		Visible characters and spaces (i.e. anything except control characters, etc.) | 
		[\x20-\x7E] | 
		\P{C} | 
		  | 
		\p{Print} | 
	
	
		| [:punct:] | 
		Punctuation and symbols. | 
		[!"#$%&'()*+,\-./:;<=>?@[\\\]^_`{|}~] | 
		[\p{P}\p{S}] | 
		  | 
		\p{Punct} | 
	
	
		| [:space:] | 
		All whitespace characters, including line breaks | 
		[ \t\r\n\v\f] | 
		[\p{Z}\t\r\n\v\f] | 
		\s | 
		\p{Space} | 
	
	
		| [:upper:] | 
		Uppercase letters | 
		[A-Z] | 
		\p{Lu} | 
		  | 
		\p{Upper} | 
	
	
		| [:word:] | 
		Word characters (letters, numbers and underscores) | 
		[A-Za-z0-9_] | 
		[\p{L}\p{N}\p{Pc}] | 
		\w | 
		  | 
	
	
		| [:xdigit:] | 
		Hexadecimal digits | 
		[A-Fa-f0-9] | 
		[A-Fa-f0-9] | 
		  | 
		\p{XDigit} | 
	
Collating Sequences
A POSIX locale can have collating sequences to describe how certain characters or groups of characters 
should be ordered. E.g. in Spanish, ll like in tortilla is treated as one character, 
and is ordered between l and m in the alphabet. You can use the collating sequence 
element [.span-ll.] inside a bracket expression to match ll. E.g. the regex torti[[.span-ll.]]a 
matches tortilla. Notice the double square brackets. One pair for the bracket expression, and 
one pair for the collating sequence.
I do not know of any regular expression engine that support collating sequences, other than POSIX-compliant 
engines part of a POSIX-compliant system.
Note that a fully POSIX-compliant regex engine will treat ll as a single character when 
the locale is set to Spanish. This means that torti[^x]a also matches tortilla.
[^x] matches a single character that is not an x, which includes ll in the 
Spanish POSIX locale.
In any other regular expression engine, or in a POSIX engine not using the Spanish locale, torti[^x]a 
will match the misspelled word tortila but will not match tortilla, as [^x] 
cannot match the two characters ll.
Finally, note that not all regex engines claiming to implement POSIX regular expressions actually 
have full support for collating sequences. Sometimes, these engines use the regular expression syntax 
defined by POSIX, but don't have full locale support. You may want to try the above matches to see if 
the engine you're using does. E.g.
Tcl's regexp command supports 
collating sequences, but Tcl only supports the Unicode locale, which does not define any collating sequences. 
The result is that in Tcl, a collating sequence specifying a single character will match just that character, 
and all other collating sequences will result in an error.
Character Equivalents
A POSIX locale can define character equivalents that indicate that certain characters should be considered 
as identical for sorting. E.g. in French, accents are ignored when ordering words. élève comes 
before être which comes before événement. é and ê are all the same 
as e, but l comes before t which comes before v. With the locale 
set to French, a POSIX-compliant regular expression engine will match e, é, è 
and ê when you use the collating sequence [=e=] in the bracket expression [[=e=]].
If a character does not have any equivalents, the character equivalence token simply reverts to the 
character itself. E.g. [[=x=][=z=]] is the same as [xz] in the French locale.
Like collating sequences, POSIX character equivalents are not available in any regex engine that 
I know of, other than those following the POSIX standard. And those that do may not have the necessary 
POSIX locale support. 
Here too 
Tcl's regexp command supports character equivalents, but Unicode locale, the only one Tcl supports, 
does not define any character equivalents. This effectively means that [[=x=]] and [x] 
are exactly the same in Tcl, and will only match x, for any character you may try instead of 
"x".  
Basic regular expression
The Basic Regular Expressions or BRE flavor is essentially the same as used 
by the traditional grep command. This is pretty much the oldest regular 
expression flavor still in use today. One thing that sets this flavor apart is 
that most metacharacters require a backslash to give the metacharacter its 
flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the 
meaning of metacaracters. Using a backslash to escape a character that is never 
a metacharacter is an error.
A BRE supports 
POSIX bracket expressions, which are similar to character classes in other 
regex flavors, with a few special features. Shorthands are not supported. Other 
features using the usual metacharacters are the
dot to match any 
character except a line break, the
caret and dollar to 
match the start and end of the string, and the
star to repeat the 
token zero or more times. To match any of these characters literally, escape 
them with a backslash.
The other BRE metacharacters require a backslash to give them their special 
meaning. The reason is that the oldest versions of UNIX grep did not support 
these. The developers of grep wanted to keep it compatible with existing regular 
expressions, which may use these characters as literal characters. The BRE
a{1,2} matches
a{1,2} literally, while 
a\{1,2\} 
matches a or aa. Some 
implementations support \? and \+ 
as an alternative syntax to \{0,1\} and
\{1,\}, but \? and
\+ are not part of the POSIX standard. Tokens can be 
grouped with \( and \). 
Backreferences are the usual \1 through
\9. Only up to 9 groups are permitted. E.g.
\(ab\)\1 
matches abab, while (ab)\1 
is invalid since there's no capturing group corresponding to the backreference
\1. Use 
\\1 to match 
\1 literally.
POSIX BRE does not support any other features. Even
alternation is 
not supported.
The regular expression pattern makes use of wildcard characters to represent one or more characters 
in the data stream. There are plenty of instances in Linux where you can specify a wildcard character 
to represent data you don't know about. You've already seen an example of using wildcard characters 
with the Linux ls command for listing files and directories 
Ls implements even more limited regular expression engine in which ? is used 
instead of dot 
Extended regular expression
Extended regular expressions are mainly used in egrep (although now the usage 
of grep -P is preferable),  SED and AWK. 
From Wikipedia
   
      
         
            
               A regular expression, often called a pattern, is an expression used to specify 
               a
               
               set of strings required for a particular purpose. A simple way to specify a finite 
               set of strings is to list its
               elements 
               or members. However, there are often more concise ways to specify the desired set of 
               strings. For example, the set containing the three strings "Handel", "Händel", and "Haendel" 
               can be specified by the pattern H(ä|ae?)ndel; we say that this pattern
               matches each of the three strings.
               In most
               
               formalisms, if there exists at least one regular expression that matches a particular 
               set then there exists an infinite number of other regular expression that also match 
               it—the specification is not unique. Most formalisms provide the following operations 
               to construct regular expressions.
               
                  - Boolean "or"
 
                  - A vertical 
                  bar separates alternatives. For example, 
gray|grey can 
                  match "gray" or "grey". 
                  - Grouping
 
                  - Parentheses 
                  are used to define the scope and precedence of the
                  
                  operators (among other uses). For example, 
gray|grey and
                  gr(a|e)y are equivalent patterns which both describe the set 
                  of "gray" or "grey". 
                  - Quantification
 
                  - A
                  
                  quantifier after a
                  
                  token (such as a character) or group specifies how often that preceding element 
                  is allowed to occur. The most common quantifiers are the
                  question 
                  mark 
?, the
                  asterisk
                  * (derived from the
                  Kleene star), 
                  and the plus sign
                  + (Kleene 
                  plus). 
               
               
                  - 
                  
                     
                        ? | 
                        The question mark indicates zero or one occurrences of the preceding 
                        element. For example, colou?r matches both "color" and "colour". | 
                     
                     
                        * | 
                        The asterisk indicates zero or more occurrences of the preceding 
                        element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", 
                        and so on. | 
                     
                     
                        + | 
                        The plus sign indicates one or more occurrences of the preceding 
                        element. For example, ab+c matches "abc", "abbc", "abbbc", and 
                        so on, but not "ac". | 
                     
                     
                        {n} | 
                        The preceding item is matched exactly n times. | 
                     
                     
                        {min,} | 
                        The preceding item is matched min or more times. | 
                     
                     
                        {min,max} | 
                        The preceding item is matched at least min times, but not more than
                        max times. | 
                     
                  
                   
               
               These constructions can be combined to form arbitrarily complex expressions, much 
               like one can construct arithmetical expressions from numbers and the operations +,
               −, ×, and ÷. For example, H(ae?|ä)ndel and
               H(a|ae|ä)ndel are both valid patterns which match the same strings 
               as the earlier example, H(ä|ae?)ndel.
               The precise syntax 
               for regular expressions varies among tools and with context; more detail is given in 
               the Syntax 
               section.
            
          
       
    
 
 
Softpanorama Recommended
Society
Groupthink :
Two Party System 
as Polyarchy : 
Corruption of Regulators :
Bureaucracies :
Understanding Micromanagers 
and Control Freaks : Toxic Managers :  
Harvard Mafia :
Diplomatic Communication 
: Surviving a Bad Performance 
Review : Insufficient Retirement Funds as 
Immanent Problem of Neoliberal Regime : PseudoScience :
Who Rules America :
Neoliberalism
 : The Iron 
Law of Oligarchy : 
Libertarian Philosophy
Quotes
 
War and Peace 
: Skeptical 
Finance : John 
Kenneth Galbraith :Talleyrand :
Oscar Wilde :
Otto Von Bismarck :
Keynes :
George Carlin :
Skeptics :
Propaganda  : SE 
quotes : Language Design and Programming Quotes :
Random IT-related quotes : 
Somerset Maugham :
Marcus Aurelius :
Kurt Vonnegut :
Eric Hoffer :
Winston Churchill :
Napoleon Bonaparte :
Ambrose Bierce : 
Bernard Shaw : 
Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient 
markets hypothesis :
Political Skeptic Bulletin, 2013 :
Unemployment Bulletin, 2010 :
 Vol 23, No.10 
(October, 2011) An observation about corporate security departments :
Slightly Skeptical Euromaydan Chronicles, June 2014 :
Greenspan legacy bulletin, 2008 :
Vol 25, No.10 (October, 2013) Cryptolocker Trojan 
(Win32/Crilock.A) :
Vol 25, No.08 (August, 2013) Cloud providers 
as intelligence collection hubs : 
Financial Humor Bulletin, 2010 :
Inequality Bulletin, 2009 :
Financial Humor Bulletin, 2008 :
Copyleft Problems 
Bulletin, 2004 :
Financial Humor Bulletin, 2011 :
Energy Bulletin, 2010 : 
Malware Protection Bulletin, 2010 : Vol 26, 
No.1 (January, 2013) Object-Oriented Cult :
Political Skeptic Bulletin, 2011 :
Vol 23, No.11 (November, 2011) Softpanorama classification 
of sysadmin horror stories : Vol 25, No.05 
(May, 2013) Corporate bullshit as a communication method  : 
Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): 
the triumph of the US computer engineering :
Donald Knuth : TAoCP 
and its Influence of Computer Science : Richard Stallman 
: Linus Torvalds  :
Larry Wall  :
John K. Ousterhout : 
CTSS : Multix OS Unix 
History : Unix shell history :
VI editor :
History of pipes concept :
Solaris : MS DOS 
:  Programming Languages History :
PL/1 : Simula 67 :
C :
History of GCC development : 
Scripting Languages :
Perl history   :
OS History : Mail :
DNS : SSH 
: CPU Instruction Sets :
SPARC systems 1987-2006 :
Norton Commander :
Norton Utilities :
Norton Ghost :
Frontpage history :
Malware Defense History :
GNU Screen : 
OSS early history
Classic books:
The Peter 
Principle : Parkinson 
Law : 1984 :
The Mythical Man-Month : 
How to Solve It by George Polya :
The Art of Computer Programming :
The Elements of Programming Style :
The Unix Hater’s Handbook :
The Jargon file :
The True Believer :
Programming Pearls :
The Good Soldier Svejk : 
The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society :
Ten Commandments 
of the IT Slackers Society : Computer Humor Collection 
: BSD Logo Story :
The Cuckoo's Egg :
IT Slang : C++ Humor 
: ARE YOU A BBS ADDICT? :
The Perl Purity Test :
Object oriented programmers of all nations 
: Financial Humor :
Financial Humor Bulletin, 
2008 : Financial 
Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related 
Humor : Programming Language Humor :
Goldman Sachs related humor :
Greenspan humor : C Humor :
Scripting Humor :
Real Programmers Humor :
Web Humor : GPL-related Humor 
: OFM Humor :
Politically Incorrect Humor :
IDS Humor : 
"Linux Sucks" Humor : Russian 
Musical Humor : Best Russian Programmer 
Humor : Microsoft plans to buy Catholic Church 
: Richard Stallman Related Humor :
Admin Humor : Perl-related 
Humor : Linus Torvalds Related 
humor : PseudoScience Related Humor :
Networking Humor :
Shell Humor :
Financial Humor Bulletin, 
2011 : Financial 
Humor Bulletin, 2012 :
Financial Humor Bulletin, 
2013 : Java Humor : Software 
Engineering Humor : Sun Solaris Related Humor :
Education Humor : IBM 
Humor : Assembler-related Humor :
VIM Humor : Computer 
Viruses Humor : Bright tomorrow is rescheduled 
to a day after tomorrow : Classic Computer 
Humor 
The Last but not Least  Technology is dominated by 
two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. 
Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org 
was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) 
without any remuneration. This document is an industrial compilation designed and created exclusively 
for educational use and is distributed under the Softpanorama Content License. 
Original materials copyright belong 
to respective owners. Quotes are made for educational purposes only 
in compliance with the fair use doctrine.  
FAIR USE NOTICE This site contains 
		copyrighted material the use of which has not always been specifically 
		authorized by the copyright owner. We are making such material available 
		to advance understanding of computer science, IT technology, economic, scientific, and social  
		issues. We believe this constitutes a 'fair use' of any such 
		copyrighted material as provided by section 107 of the US Copyright Law according to which 
such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) 
site written by people for whom English is not a native language. Grammar and spelling errors should 
be expected. The site contain some broken links as it develops like a living tree...
Disclaimer: 
The statements, views and opinions presented on this web page are those of the author (or 
referenced source) and are 
not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness 
of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be 
tracked by Google please disable Javascript for this site. This site is perfectly usable without 
Javascript. 
Last modified: 
March, 12, 2019