|
Home | Switchboard | Unix Administration | Red Hat | TCP/IP Networks | Neoliberalism | Toxic Managers |
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and bastardization of classic Unix |
|
|
Regular expressions are a language inside the language. Regex should be viewed as a separate language that has no direct connections to Perl. It is used with many other languages (Python, PHP, Java) in almost the same form as in Perl just with different syntactic sugar. Still Perl was the first language to introduce "close binding" of regex and the language per se, the feature that was later more or less successfully copied to Python, TCL and other languages. Also the level of integration of the regular expression language into m
ain language is higher in Perl, then in any alternative scripting language. Still as it is a different language some problems arise. For example Perl debugger can't debug regular expressions.
Perl language regular expression parser gradually evolves. The latest significant changes were introduced in version 5.10 make it more powerful and less probe to errors. This version of Perl is minimal version recommended for any serious text parsing work.
As regular expressions is a new language, using the famous "Hello world" program as the first program seems to be appropriate. As a remnant from shell/AWK legacy a regular expression lexically is a special type of literals (similar to double quoted literal). It is usually is included in slashes, and the source string (where matching occurs) is specified on the right side of the special =~ operator (matching operator). The simplest case is to search substring in string like in built-in function index. The following expression is true if the string Hello appears anywhere in the variable $sentence.
$sentence = "Hello world"; if ($sentence =~ /Hello/) {...} # expression is true if the string the appears in variable $sentence.
The regular expressions (called also regex of RE) are case sensitive, so if we assign to $sentence the same string but in lower case
$sentence = "hello world";
then the above match will fail. The operator !~ can be used for spotting a non-match. In the above example
$sentence !~ /Hello/
is true if the string Hello does not appear in $sentence.
There are two main uses for regular expressions in Perl:
/Hello/will search Hello in $_.
Regular expressions in Perl operate against strings. No arrays on left hand side of matching statement please.
Regular expressions in Perl operate against strings.
|
$my_string = "The graph has many leaves"; if ( $my_string =~ m/graph/ ) { print("The source string contains the word 'graph'.\n");} $result =~ s/graph/tree/; print "Replaced with 'tree'\n"; } print("initial string: '$my_string'\n.The result is '$result'\n");In this example each of the regular expression operators applies to the $my_string variable instead of $_.
We can capture the success or failure of the match in a scalar variable. This way we have a way to determine the success or failure of the matching and substitution, respectively:
@test_array=("The graph has many leaves", "Fallen leaves, so many leaves on the ground."); foreach $test (@test_array) { $match = ($test =~ m/leaves/); print("Result of match of word 'leaves' in string '$test' is $match\n"); }
This program displays the following:
Result of match of word 'leaves' in string 'The graph has many leaves' is 1 Result of match of word 'leaves' in string 'Fallen leaves, so many leaves on the ground' is 1
The other useful feature of this example is that it shows you how to obtain the return values of the regular expression operators. In case subsequent action depends on the value of changed variables you should always check if the expression successive or failed because way to often regular expression behave differently then their creators expect.
In scalar context the match operation returns the number of matches. That means that if match failed it returns zero. |
We could use a conditional as to check if match was successful or no:
$sentence = "Disneyworld in Orlando"; if ($sentence =~ /world/){ print "there is a substring 'world' somewhere in the sentence: $sentence\n"; }
Sometimes it's easier to test the special variable $_, especially if you need to test each input string in the input loop. In this case you can write something like:
while (<>) { # get "Hello world" from the input stream if (/world/) { print "There is a word 'world' in the sentence '$_'\n"; } }
As we already have seen the $_ variable is the default for many Perl built-in functions (tr, split, etc).
The problem with regex metacharacters is that there are plenty of them. They provide a lot of power for sophisticated user and at the same time make them appear very complicated, at least at the very beginning.
It's best to build up your skills slowly: creation of complex regex can be considered as a kind of an art form (like solving a a puzzle or chess problems). Please pay special attention to non-greedy (lazy) quantifiers as they are simpler to use and less prone to errors.
It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output.
Please pay special attention to
non-greedy
(lazy) quantifiers as they are simpler to use and less prone to errors. It makes a lot of sense first to debug a complex regular expression is a special test script, feeding it with sample strings and observing the output. |
As they are used as metacharacters, characters $, |, [],{} (), \, / ^, / and several others in regular expressions should be preceded by a backslash, for example:
\| # Vertical bar \[ # An open square bracket \) # A closing parenthesis \* # An asterisk \^ # A carat symbol \/ # A slash \\ # A backslash
For example:
$ip_addr=~/\d+\.\d+\.\d+\.\d+/; # dot character should be escaped
Regular metacharacters are special characters that represent some class of symbols. They consume one character from the string if they are matched (with quantifiers it can be less or more). In other word, they 'eats' characters of the class they represent. A good example is metacharacter that consumes characters is . (dot) which match any character. Among the most common regular metacharacters are:
[0-9]
[a-zA-Z_0-9]
[ \t\n\r\f]
If you use capital latter instead of lower case letter the meaning of metacharacter reverses:
Anchors are metacharacters that serve as markers and that never consume characters from the string. Anchors always match zero number of characters of a particular class. That means that they do not require any character to be present, only some logical condition is this place of the string needs to be true. Anchors don't match a character, they match a condition. In other words they do not consume any symbols. They just tell the regex engine that the particular match occurred. Two most common anchors are ^ and $:
For example to match the first word "Perl" on the line we can use the following regex :
/^Perl/;
Perl has three groups of quantifiers (which are also metacharacters, but they affect interpretation of previous character). The most important metacharacters include three groups with two members in each - one greedy and the other non-greedy (lazy). There are three major modifier in Perl:
Non greedy modifies are newer but easier to understand as they correspond to search of substring, Greedy modifies correspond to search of the last occurrence of the substring. That's the key difference. We will discuss not greedy modifies in the next section: More Complex Perl Regular Expressions
For example:
$sentence="Hello world"; if ($sentence =~ /^\w+/) { # true if the sentence starts with a word like "Hello" print "The string $sentence starts with a word\n"; }Full list includes 12 quantifiers:
Maximal (greedy) |
Minimal (lazy) |
Allowed Range |
---|---|---|
{ n,m} |
{ n,m}? |
Must occur at least n times but no more than m times |
{ n,} |
{ n,}? |
Must occur at least n times |
{ n} |
{ n}? |
Must match exactly n times |
* |
*? |
0 or more times (same as {0,} ) |
+ |
+? |
1 or more times (same as {1,} ) |
? |
?? |
0 or 1 time (same as {0,1} ) |
We will discuss additional quantifiers later
It's probably best to build up your use of regular expressions slowly from simplest cases to more complex. You are always better off starting with simple expressions, making sure that they work and them adding additional more complex elements one by one. Unless you have a couple of years of experience with regex do not even try to construct a complex regex one in one quaint step.
Here are a few examples:
$a = '404 - - '; $a =~ /40\d/; # matches 400, 401, 403, etc.
Here we took a fragment of a record of the http log and tries to match the return code. Note that you can match any part of the integer, not only the whole integer. A similar idea works for real, but generally real numbers have much more complex syntax:
$target='simple real number: 22.33'; $target=~/\d+\.\d*/;
Note: the regex /\d+\.\d*/ isn't a general enough to match all the real numbers permissible in Perl or any other programming language. This is a actually a pretty difficult problem, given all of the formats that programming languages usually support and here regular expressions are of limited use: lexical analyzer is a better tool.
Now let's try to match works. The simplest regular expression that matches a single word is \w+. Here is a couple of examples:
$target='hello world'; $target~ m{(\w+)\s+(\w+)}; # detecting two words separated by white space
$target='A = b'; $target =~ /(\w+)\s*=\s*(\w+)/; # another way to ignore white space in matching
Here are more examples of simple regular expressions that might be reused in other contexts:
/t.t/ # t followed by any letter followed by t ^131 # 131 at the beginning of a line 0$ # 0 at the end of a line \.txt$ # .txt at the end of a line /^newfile\.\w*$/ # newfile. with any followed by zero or more arbitrary characters # This will match newfile.txt, new_prg, newscript, etc. /^.*marker/ # head of the string up and including the word "marker" /marker.*$/ # tail of the string starting from the 'market' and till the end (up to newline). /^$/ # An empty line
Several additional examples:
/0/ # zero: "0" /0*/ # zero of more zeros /0+/ # one or more zeros /0*0/ # same as above /\d/ # any digit but only one /\d+/ # any integer /\d+\.\d*/ # a subset of real numbers. Please note that 0. is a real number /\d+\.\d+\.\d+\.\d+/ # IP addresses starting (no control of the number of digits so 1000.1000.1000.1000 would match this regex /\d+\.\d+\.\d+\.255/ # IP addresses ending with 255
Tips:
Complex regex are constructed from simple regular expressions using the following metacharacters:
Perl provides several capability to specify how many times a given component must be present before the match is true. You can specify both minimum and maximum number of repetitions.
One can see that old quantifiers that we already know (*, + and ?) can be expressed via this one:
m/^\s*\w+/;
Be careful when using the * quantifier because it can match an empty string, which might not be your intention. The regex /b*/ will match any string - even one without any b characters. |
At times, you may need to match an exact number of components. The following match statement will be true only if five words are present in the $_ variable:
$_ = 131.1.1.1 - joejerk [21/Jan/2000:09:50:50 -0500] "GET http://216.1.1.1/xxxgirls/bigbreast.gif HTTP/1.0" 200 51500 m/(\w+\s+){3}/; # get the user name of the offenderIn this example, we are interested in getting exactly the third word which corresponds to the user id in HTTP logs. After match $3 should contain this id.
The same ideas can be used for processing date and time in the HTTP logs.
The character class [0123456789] or, shorter, [0-9] defines the class of decimal digits, and [0-9a-fA-F] defines the class of hexadecimal digits. You should use a dash to define a range of consecutive characters. You can use metacharacters inside character classes ( but not as endpoints of a range). For example:
$test = "A\t12"; if ( m/[XYZ\s]/ ) { print "Variable test matched the regex\n"
which will display
Variable test matched the regex
because the value of $test includes the tab character which matched metacharacter \s in the character class [XYZ\s].
Meta-character . and modifiers ?, *, + that appear inside the square brackets that define a character class are used in their literal sense. They lose their meta-meaning. |
Alternation allow you to provide several alternative regex and only one of those would match for the success of regex. In other words, the regular expression:
/^foreach|^for|^while/
means "look for the line beginning with the string 'for' OR the string 'if' or the string 'while'."
The ( | ) syntax split regular expression on sections and each section will be tried independently. Alternation always tries to match the first item in the parentheses. If it doesn't match, the second pattern is then tried and so on.
This is called left most matching, and it is very similar to short-circuit operator ||. Misunderstand the fact that evaluation of alternative stops at the first success can lead to subtle bugs if one string in alternation is a substring of the other:
$line = 'foreach $i (@n) { $sum+=$i;}'; if ($line =~ /^for|^while/ ) { print "Regular loop\n"; } elsif ( $line =~ ^foreach/ ) { print "Foreach loop\n"); }
In this case the string foreach will never be matched as string for will match before it. This is a common a mistake and it is prudent always put longest string first in such cases. This tip is also helpful when you don't know whether or not a word will be followed by a delimiter, or an end of line character, or whether or not a word is plural:
for $line ('words', 'word') { if ($line =~ /\bwords\b/ ) { print "singular\n"; } elsif ($line =~/\bword\b/ ) { print "Plural\n"; }
In general longest match should be first
The useful modifier for matching is modifier i (ignore case). for example
$line =~ /"word(s?)/iwill match "word" or "words" independent of case.
If some part of regex is enclosed in parenthesis it is considered a group and matching to this groups substring is assigned to special variables $1, $2,.... For example:
$ip='10.192.10.1'; $ip=~/(\d+)\.(\d+)\.(\d+)\.(d+)/; Print "Ip address components are $1, $2, $3 and $4\n";
NOTE
(?:<regexp>) grouping without creating a backreference.This extension lets you add parentheses to your regular expression without causing a regex memory position to be used.
You can use alteration within the group, for example
(red|blue)If regular expression matched the string, the substrings that matched each group are assigned to so called "capture variables": $1, $2, $3, ... In other words each group captures what it content matche and assign it to corresponding capture variable.
One important feature of capture variables is that only the successful match affect them. If a match is unsuccessful, then previous values are preserved whatever they may be. That leads to difficult to find errors if you are not careful. You should never use capture variables without checking if the march is successful or not.
One important feature of capture variables is that only the successful match affect them. If a match is unsuccessful, then previous values are preserved whatever they may be. That leads to difficult to find errors if you are not careful. You should never use capture variables without checking if the march is successful or not. |
This feature of Perl is rarely discussed in textbooks and is very error prone. Errors are difficult to pinpoint as they are depend on whether the match was successful of not. And what is worse you forget about this "feature" from time to time and make the same mistake again and again. I think that this is a design blunder of Perl. It should set all capture variables undefined in can of unsuccessful match.
What is worse you forget about this "feature" from time to time and make the same mistake again and again. And then spend a day debugging your now semi-forgotten script when you accidentally discover that it misbehaves in certain cases. I think that this is a design blunder of Perl. It should set all capture variables undefined in can of unsuccessful match. Please check your scripts for usage of capture variable and manually check in each case that if statement for matching is used. |
For example the regex /\w+/ will let you determine if $_ contains a word, but does not let you know what the word is. In order to accomplish that, you need to enclose the matching components with parentheses. For example:
if ( m/(\w+)/ ) { $word=$1; }
By doing this, you force Perl to store the matched string into the $1 variable. The $1 variable can be considered as pattern memory or backreference.
We will discuss backreferences in more details later.
As well as identifying substrings that match regular expressions Perl can make substitutions based on those matches. The way to do this is to use the s function which mimics the way substitution is done in the vi text editor. If the target string is omitted then the substitution is assumed to take place with the $_ variable.
To replace an occurrence of regular expression h.*?o by string Privyet; in the string $sentence we use the expression
$sentence =~ s/h.*?o/Privyet/;
and to do the same thing with the $_ variable just write the right side of the previous operator:
s/h.*?o/Privyet/;
The first part of this expression is called matching pattern and the second part is called substitution string. The result of a substitution operator in the scalar context is the number of substitutions made, so it is either 0 (false) or 1 (true) in this case.
The result of a substitution operator in the scalar context is the number of substitutions made |
This example only replaces the first occurrence of the string, and it may be that there will be more than one such string we want to replace. To make a global substitution the last slash is followed by a g modifier as follows:
s/h.*?o/Privyet/g
Here the target is $_ variable. The expression returns the number of substitutions made ( 0 is none).
If we want to also make replacements case insensitive with the modifier i (for "ignore case"). The expression
s/h.*?o/Privyet/gi
will force the regex engine to ignoring case. Note that case will be ignored only in matching -- substitution string will in inserted exactly as you specified.
|
The substitution operator can also be used to delete any substring. In this case the replacement string should be omitted. For example to remove the substring "Nick" from the $_ variable, you could write: s/Nick//;
There is additional modifier that is applicable to both regex and replacement string -- /e modifier that changes the interpretation of the pattern delimiters. If used, variable interpolation is active even if single quotes are used.
Like in index function you can use variables in both matching pattern and substitution string. For instance:
# let's assume that $_ = "Nick Bezroukov"; $regex = "Nick"; $replacement_string = "Nickolas"; $result = s/$regex/$replacement_string/;
Here is a slightly more complex example of replacment (Snort rules):
#alert udp $site_dhcp 63 -> any any (msg:"policy tftp to dchp segment"; classtype:attempted-admin; sid:235; rev:60803;) $new="classtype:$ARGV[0];"; while(<>) { $line=$_; $line=~s[classtype\:.*?\;][$new]; print $line; }
This program changes the $_ variable by performing the replacement and the $result variable will be equal to 1 -- the number of substitutions made.
For a single substitution of a string a similar capability is available with built-in function substr.
$result = substr($_,index($_'Nick'),length('Nick'));
If would be nice to be able to match on array too but this is not the case. If you try something such as:
@buffer =~ m/yahoo/; # Wrong way to search for a string in the array
In the example above the array @buffer will be converted to scalar (number of its elements) and if we assume that the array has 10 elements that means that you will be doing something like:
'10' =~ m/yahoo/;
The right way to solve this problem is to use grep function like in an example below:
grep(m/variable/, @buffer);
In scalar context the number of matches will be returned. In array context the list of elements that matched will be returned.
Each matched group in matching pattern can be referenced with so called backreferences. Backreferences are also numbers consecutively \1, \2. \3. ... They can be used both in matching pattern and in replacement string.
With the "standard" notation you need to use a backslash to escape special character you want to match. That means that you still need to use the backslash character to escape any of the meta-characters including slash, which is pretty common in Unix pathnames. For example:
$path =~ m/usr\/local\/bin/;
This tries to match /usr/local/bin in $path. As we saw regular expressions are usually delimited by slashes and if regex that you want to match contain a lot of slashes the whole regular expression becomes unreadable
m/Hello/;
To rectify this problem Perl allows the use of alternative regex delimiters (delimiter that marks the beginning and end of a given regular expression) if you use initial m for matching:
m{/usr/local/bin/}
Actually { } is probably the most readable alternative variant that permit easy finding of opening and closing brackets in any decent editor (including Emacs, vi, vim, Slickedit, MultiEdit). But if you wish you can use other symbols, for example:
m"/usr/local/bin" # here double quote serves as a regex delimiter
Note that if a left bracket is used as the starting delimiter, then the ending delimiter must be the right bracket. Both the match and substitution operators let you use variable interpolation.
In case you regex contains a lot of special symbols you can first assign it to a single-quoted string and then use variable in the matching operator. The regex inside slashes are treated like double quoted strings and you can interpolate with them with a variable. For example:
$profile = '/root/home/.profile'; m/$profile/;
The same trick works for substitution too, for example:
s{USA}{Canada}sg;
This capability to find matching bracket can be useful when we deal with multiple line regular expressions using extension syntax described below.
If the match regex evaluates to the empty string, the last valid regex is used. So, if you see a statement like
if (//) {print;}
in a Perl program, look for the previous regular expression operator to see what the regex really is. The substitution operator also uses this interpretation of the empty regex (but never for the substitution part which is a string, not a regular expression).
Extended mode which is activated by using modifier x provides capability to write comments within the regex as well as use whitespace freely for readability. For example, instead of regular expression:
# Match an assignment like a=b;. $1 will be the name of the variable and the # first word. $2 will be the second word. m/^\s+(\w+)\W+(\w+)\s+$/;
We can write
m/^\s+ (?# leading spaces) (w+) (?# get first word) \s*=\s* (?# match = with white space before and after ignored ) (.*) (?# right part ) \; (?# final semicolon) /xHere we move groups to separate lines it improves readability and gives us opportunity to put comments using asymmetrical brackets (?# and )
But you can go too far and "kill with kindness". Here is an example of over-commented regular expression that is more difficult to read the one line version:
m/ (?# This regex will match any Unix style assignments in configuration file delimited with semicolon (?# results are put into $1 and $2 if the match is successful.) ^ (?# Anchor this match to the beginning of the string) \s* (?# skip over any whitespace characters) (?# we use the * because there may be none) (\w+) (?# Match the first word, put in the first variable) (?# the first word because of the anchor) \W+ (?# Match at least one non-word) (?# character, there may be more than one) (\w+) (?# Match another word, put into the second variable) \s* (?# skip over any whitespace characters) (?# use the * because there may be none) $ (?# Anchor this match to the end of the) (?# string. Because both ^ and $ anchors) (?# are present, the entire string will) (?# need to match the regex. A sub-string will not match.) /x;
Please note that the really important trick of using \W to match any combination of delimited like "=" " = " or " =" remains unexplained. In a way those comments make regex more difficult to understand, not easier. In general, if you do not such an excesses. In commenting the first rule is not too much zeal ;-).
Along with the ability to add comments, suffix x also provides addition matching capabilities:
The most useful of the extensions listed above is grouping without creating a backreference.
You can also specify regex modifiers inside the regex itself
(?sxi)
This extension lets you specify an embedded modifier in the regex rather than adding it after the last delimiter. This is useful if you are storing regexs in variables and using variable interpolation to do the matching.
Extensions also let you change the order of evaluation without assigning the value of matched group to special variables ($1, $2,...). For example,
m/(?:Operator|Unix)+/;matches the strings Operator and Unix in any order. No special regex variables ($1, $2, $3,...) will be assigned.
At times, you might like to include a regex component in your regex without including it in the $& variable that holds the matched string. The technical term for this is a zero-width positive look-ahead assertion. You can use this to ensure that the string following the matched component is correct without affecting the matched value. For example, if you have some data that looks like this:
A:B:C
and you want to find all operators in /etc/passwd file and store the value of the first column, you can use a look-ahead assertion. This will do both tasks in one step. For example:
while (<>) { push(@array, $&) if m/^\w+(?=\s+Operator\s+)/; } print("@array\n");Let's look at the regex with comments added using the extended mode. In this case, it doesn't make sense to add comments directly to the regex because the regex is part of the if statement modifier. Adding comments in that location would make the comments hard to format.
So we can use a different tactic and put the regex in variable
$ regex = '^\w+ (?# Match the first word in the string) (?=\s+ (?# Use a look-ahead assertion to match) (?# one or more whitespace characters) Operator (?# text to match but not to include) \s+' (?# one or more whitespace characters) while (<>) { push(@array, $&) if m/$ regex/xo; } print("@array\n");Here we used a variable to hold the regex and then used variable interpolation in the regex with the match operator. To speed things up we use o modifier, which tells Perl to evaluate regular expression only once.
The last extension that we'll discuss is the zero-width negative assertion. This type of component is used to specify values that shouldn't follow the matched string. For example, using the same data as in the previous example, you can look for everyone who is not an operator. Your first inclination might be to simply replace the (?=...) with the (?!...) in the previous example.
There are many ways of matching any value.
If the first method you try doesn't work, try breaking the value into smaller components and match each boundary. If all else fails, you can always ask for help on the comp.lang.perl.misc newsgroup. |
One of its more common uses of regex is find a substring in a string but remember that in simple cases the index function is simpler and better. Remember that regular expression matching is greedy and you will get the longest match possible:
$regex = "a*a"; $_ = "abracadabra"; if m/$regex/ {print "Found $regex in $_\n"When matching lines in a file you can print matched strings along with their line number using special variable $.
$target = "yahoo"; open(INPUT, "< visited_sites.dat"); while (<INPUT>) { if (/$target/o ) { print "Site $target was visited: $. $_"; } } close(INPUT);>
The $. special variable keeps track of the record number. Every time the diamond operators read a line, this variable is incremented. |
Please note that this example would be better programmed using the index function.
So the question arise what is are additional capabilities of regexs that made them superior to string fuctions in complex situations. The answer is that regexs have so called matching memory or regex memory-- a set of special variables that are assigned values during matching operation for regexs a whole and each component of the regex enclosed inside parentheses. regex memory often called backreferences. This memory persists after the execution of a particular match statement. You can think about backreferences as a special kind of assignment statements.
Each time you use the parentheses '()' in regex Perl assumes that you want to assign the result of matching to a special variable with names like $1, $2, $3.... ). Naturally $1 can be used to refer to the first, $2 -- the second, $3 -- the third matched sub-pattern. These variables can then be accessed directly, by name, or indirectly by assigning the matching expression to an array.
You saw a simple example of this earlier right after the component descriptions. That example looked for the first word in a string and stored it into the first buffer, $1. The following small program
$_ = "a=5"; m/(\w+) = (\d+) /; print("$1, $2);will display
A, 5This is a simplified example of how one can process Unix-style configuration files. You can use as many buffers as you need. Each time you add a set of parentheses, another buffer is used. If you want to find all the word in the string, you need to use the /g match modifier. In order to find all the words, you can use a loop statement that loops until the match operator returns false.
$_ = "word1 word3 word3"; while (/(\w+)/g) { print("$1\n"); }Naturally, this program will display
word1 word2 word3because of each iteration exactly one new match will be printed. As you can see regex has internal memory and in case of using modifier g in the loop will continue extract parts of the the initial strings one by one. But much more interesting approach to a similar problem is to use array on the left side of the assignment statement:
$_ = "word1 word2 word3"; @matches = /(\w+)/g; print("@matches\n");The program will display:
word1 word2 word3To help you to know what matched and what did not Perl has several auxiliary built-in variables with really horrible names:
For example:
$text = "this matches 'THIS' not 'THAT'"; $text =~ m"('TH..')"; print "$` $'\n";
Here, Perl will save and later print substring "this matches '" for &` and "' not 'THAT'" for &'. the characters 'THIS' are printed out - Perl has saved them for you in $1 which later gets printed. That regular expressions match the first occurrence on a line. 'THIS' was matched because it came first. And, with the default regexp behavior, 'THIS' will always be the first string to be matched (you can change this default by modifiers - see below)
If you need to save the value of the matched strings stored in the regex memory, make sure to assign them to other variables. regex memory is local to the enclosing block and lasts only until another match is done. |
backreferences are available in the matching regex itself. In other words, if you put parentheses around a group of characters, you can use this group of characters later in the regular expression or substitution string. But there is an important syntactic difference -- if you want to use the backreferences in the matching regex you need to use the syntax \1, \2, etc. If you want to use the backreferences in substitution string you use regular $1, $2, etc.
Perl is Perl and there are some irregularities in Perl regular expressions ;-) Here are some examples:
$line = 'Hello world'; $line =~s/(\w+) (\w+)/$2 $1/; # This makes string 'world Hello'.
We can also use of backreferenece in the matching regex itself:
if (/(A)(B)\2\1/) { print "Hello ABBA";}
The example is pretty artificial, but it well illustrates the key concept. There are 4 steps to the match of this string:
Note: If variable interpolation is used in the replacement string none of the meta-characters can be used in the replacement string |
Here are some more examples:
$text = 'word1 word2 word3'; ($word1, $word3) = ($text =~ m"(\w+).*(\w+)");
Notice, however, that assignment occurs when the text string matches. When the text string does not match, then $word1 and $word3 would be empty. Try the example above with the sting "1999 2000 2001" to see the result. So, what happens if your regular expression does not match at all? Nothing will be assigned and special variable will preserve thier values (so the values from prev match if any would be used).
Backreferences are not set if a regular expression fails |
This is a frequent Perl 'gotcha'. Built-in variables like $1 does not get change if the regular expression fails. Some people think this a bug, others consider this a feature. Nonetheless, this second point becomes painfully obvious when you consider the following code.
$_ = 'Perl bugs bite'; /\w+ (\w+) \w+/; # sets $1 to be "bugs". $_ = 'Another match another bug'; /(^a.*\s)"; # /^a.*\s will not match to any substring in the string print $1 # Surprise ! "bugs" will be printed !
In this case, $1 is the string 'bugs', since the second match expression failed! This Perl behavior can cause hours of searching for bug. So, consider yourself warned. Or more to the point, always check if a match was successful before assigning anything to it. You can use one of the following three checks to avoid this type of errors:
if (/(^a.*\s)/) { $matched = $1; } else { print "matching failed"; }
($scalarName =~ m"(regular expression)") && ($match = $1);
($match1, $match2) = /(\w+).*(\w+)); if ($match1 eq '' || $match2 eq '' ) { } else { print " match failed\n" }
Although the first method is the most clean any one will do the job. In any case your regex matching code should protect from unassigned built-in variable regex matching errors.
In any case your regex matching code should protect from unassigned built-in variable regex matching errors.
In any case your regex matching code should protect from unassigned built-in variable regex matching errors. |
There are several cases:
For example:
($variable, $equals, $value) = ($line =~ m"(\w+)\s*(=)\s*(\w+)");
This takes the first reference (\w+) and makes it $variable, the second reference (=) and makes it $equals, and the third reference (\w+) and makes it $value.
Another interesting case is Matching in array context, 'g' modifier. This takes the regular expression, applies it as many times as it can be applied, and then stuffs the results into an array that consists of all possible matches. For example:
$line = '1.2 3.4 beta 5.66'; @matches = ($line =~ m"(\d*\.\d+)"g);
will make '@matches' equal to '(1.2, 3.4, 5.66)'. The 'g' modifier does the iteration, matching 1.2 first, 3.4 second, and 5.66 third. Likewise:
undef $/; my $FD = new FileHandle("file"); @comments = (<$FD> =~ m"/\*(.*?)\*/");
will make an array of all the comments in the file '$fd'
Finally, if you use the matching operator in scalar context, you get a behavior that is entirely different from anything else (in the regular expression world, and even the Perl world). This is that 'iterator' behavior we talked about. If you say:
$line = "BEGIN <data> BEGIN <data2> BEGIN <data3>" while ($line =~ m"BEGIN(.*?)(?=BEGIN|$)"sg){ push(@blocks, $1); }
This then matches the following text (in bold), and stuffs it into @blocks on successive iterations of while:
BEGIN <data>(%)BEGIN <data2> BEGIN <data3> BEGIN <data> BEGIN <data2>(%)BEGIN <data3> >BEGIN <data> BEGIN <data2> BEGIN <data3>
We have indicated via a '(%)' where each of the iterations start their matching. Note the use of (?=) in this example too! It is essential to matching the correct way, since if you don't use it, the 'matcher' will get set in the wrong place.
As backreferences are implicit assignments they can be nested. Let's discuss parsing of date format in HTTP logs.
m{\([(\d)*\])};
Here, the outermost (( )) parentheses captures the whole thing: 'softly slowly surely subtly'. The innermost (()) parentheses captures a combination of strings beginning with an s and ending with a "ly" followed by spaces. Hence, it first captures 'softly', throws it away then captures 'slowly', throws it away then captures 'surely', then captures 'subtly'.
The first two examples are fairly straightforward. '[0-9]' matches the digit '1' in 'this has a digit (1) in it'. '[A-Z]' matches the capital 'A' in 'this has a capital letter (A) in it'. The last example is a little bit trickier. Since there is only one 'an' in the regex, the only characters that can possibly match are the last four 'an A'.
However, by asking for the regex 'an [^A]' we have distinctly told the regular expression to match 'a', then 'n', then a space, and finally a character that is NOT an 'A'. Hence, this does not match. If the regex was 'match an A not an e', then this would match, since the first 'an' would be skipped, and the second matched! Lik
$scalarName = "This has a tab( )or a newline in it so it matches"; $scalarName =~ m"[\t\n]" # Matches either a tab or a newline. # matches since the tab is present
This example illustrates some of the fun things that can be done with matching and wildcarding. One, the same characters that you can have interpolated in a " " string also get interpolated in both a regular expression and inside a character class denoted by a brackets ([\t\n]). Here, "\t" becomes the matching of a tab, and "\n" becomes the matching of a newline.
m/a|b+/it's hard to tell if the regex should be
m/(a|b)+/ # match any sequence of "a" and "b" characters # in any order.or
m/a|(b+)/ # match either the "a" character or the "b" character # repeated one or more times.The order of precedence shown in below. By looking at the table, you can see that quantifiers have a higher precedence than alternation. Therefore, the second interpretation is correct.
Precedence Level | Component |
---|---|
1 | Parentheses |
2 | Quantifiers |
3 | Sequences and Anchors |
4 | Alternation |
You can use parentheses to affect the order that components are evaluated because they have the highest precedence. you need to use extended syntax or you will be affecting the regex memory. |
Now let's introduce one new thing: both the matching and the substitution operators perform variable interpolation both in the regex and substitution strings, for example:
$variable =~ m"$scalar";
then $scalar will be interpolated, turned into the value for scalar. There is a caveat here. Any special characters will be acted upon by the regular expression engine, and may cause syntax errors. Hence if scalar is:
$scalar = "({";
Then saying something like:
$variable =~ m"$scalar";
is equivalent to saying: $variable =~ m"({"; which is a runtime syntax error. If you say:
$scalar = quotemeta('({');
instead will make $scalar become '\(\{' for you, and substitute $scalar for:
$variable =~ m"\{\{";
Then, you will match the string '({' as you would like.
You can use array in regex (it will be converted to the string with elements separated by spaces like in print statement), but this is tricky and rarely used:
$variable =~ m/@arrayName/; # this equals m/elem1 elem2/;
Here, this is equal to m/elem1 elem2/. If the special variable $" was set to '|', this would be equal to m/elem|elem2/, which as we shall see, matches either 'elem' or 'elem2' in a string. This works for special characters too:
For example:
$_ = "AAA BBB AAA"; print "Found bbb\n" if m/bbb/i;This program finds a match even though the regex uses lowercase and the string uses uppercase because the /i modifier was used, telling Perl to ignore the case. The result from a global regex match (modifier g) can be assigned to an array variable or used inside a loop.
As we already know the substitution operator has all modifiers used in the matching operator plus several more. One interesting modifier is the capability to evaluate the replacement regex as an expression instead of a string. You could use this capability to find all numbers in a file and multiply them by a given percentage. Or you could repeat matched strings by using the string repetition operator.
If back quotes are used as delimiters, the replacement string is executed as a DOS or UNIX command. The output of the command is then used as the replacement text.
e -- permits interpretation of the replacement part of the regular expression as a script
Without modifiers, a dot ('.') matches anything but a newline. Sometimes this is helpful. Sometimes it is very frustrating, especially if you have data that spans multiple lines. Consider the following case:
$line = 'BLOCK: Some text END BLOCK BLOCK: Another text END BLOCK'
Now suppose you want to match the text between keyword BLOCK and "END BLOCK":
$line =~ m{ BLOCK(\d+) (.*?) END\ BLOCK # Note backslash. Space will be ignored otherwise }x;
This does not work. Since the wildcard ('.') matches every character EXCEPT a newline, the regular expression hits a dead end when it gets to the first newline.
Sometimes, as in this case, it is helpful to have the wildcard ('.') match EVERYTHING, not just the newline. And, by extension, to have the wildcard (\s) match [\n\t ], not just tabs and spaces. This is what the modifier 's' does.
In other words it forces Perl to not assume that the string you are working on is one line long. The above then does work with an s on the end of the regular expression:
$line =~ m{ BLOCK(\d+) (.*?) END\ BLOCK }xs;
With the modifier s this now works as expected.
Modifier m is an opposite of the s operator. In other words, it says 'treat the regular expression as multiple lines, rather than one line. This basically makes it so ^ and $; now match not only the beginning and ending of the string (respectively), but also make ^ match any character after a newline, and make $ match a newline. For example,
$line = 'a b c'; $line =~ m"^(.*)$"m;the m modifier will make the backreference $1 become 'a' instead of "a\nb\nc".
Modifier e provides the possibility to evaluate the second part of the s/// as a complete 'mini-Perl program' rather than as a string. This dramatically increases the power of substitution operator in Perl. For example let's assume that you want to substitute all of the letters in the following string with their corresponding ASCII number:
$string = 'hello'; $string =~ s{(\w)} # we save the $1. {ord($1). " "; } egx; print "$string\n";
This example will convert each letter into its representation (via org function) and will print
'104 101 108 108 111".
Each character was taken in turn here and run through the 'ord' function that turned it into a digit. This is pretty powerful functionality but at he same time it is difficult to read and understand. In other words it risk being incomprehensible even for the original programmer when in a month or a year he returns to make some modifications in he program .
We suggest you use such construct only if you to some length documenting why you are using it and why they are in this case better that more explicit and cleaner way of programming the same functionality: For example:
$string = turnToAscii($string); sub turnToAscii{ my ($string) = @_; my ($return, @letters); @letters = split(//, $string); foreach $letter (@letters) { $letter = ord($letter) . " " if ($letter =~ m"\w"); } $return = join('', @letters); $return; }
This latter example is longer but is more easily maintainable. However, it is not only longer it is also slower, so if this construct need need to process long strings the initial "obscure" construct has advantages.
Modifier g in substitution meant that every single instance of a regular expression was replaced. However, this is meaningless in the context of matching. In matching Perl remembers where that match occurs and starts the next matching from this place, not from the beginning of the string. When Perl hits the end of the string, the iterator is reset:
$line = "hello stranger hello friend hello sam"; while ($line =~ m"hello (\w+)"sg){ print "$1\n"; }
This outputs
stranger friend sam
and then quits, because the inherent iterator comes to the end of the expression.
There is one caveat here. With modifier g any modification to the variable being matched via assignment causes this internal iterator to be reset to the beginning for the string.
$word = "hello"; $text=<>; $i=0; while ($text =~ m"($word)"sg) { print "instance $i of the word $word was found\n" $text="$text\n Word '$word' was found with offset".length($`)."\n"; $i++; }
As the variable $text is changed inside the loop, the iterator will be reset to the beginning of the string, creating an infinite loop!
This modifier is helpful when you have a complex regex that in inside a nested loop, so the time consumed by matching greatly influence the total time the program runs.
foreach $filesystem (@fstab) { foreach $file (@files) { foreach $line in (@text) { $line =~ m"<complex regular expression>"; } } }
By default each time that Perl hits this regular expression, it compiles it. This takes time, and if your regex is complex and does not contain any variable interpolation this is unnecessary operation that can be and should be blocked.
It is not recommended and is a bad style to use modifier o with a regex that contains variable interpolation. But Perl allows this.
It assumes that you make a promise that after first evaluation the variable that represents regex will never change. If it does, Perl will not notice your change.
|
Switchboard | ||||
Latest | |||||
Past week | |||||
Past month |
Aug 26, 2020 | stackoverflow.com
never_had_a_name ,
How are non-capturing groups, i.e.
(?:)
, used in regular expressions and what are they good for?aliteralmind ,
This question has been added to the Stack Overflow Regular Expression FAQ , under "Groups". – aliteralmind Apr 10 '14 at 0:25Ricardo Nolde ,
Let me try to explain this with an example.
Consider the following text:
http://stackoverflow.com/ https://stackoverflow.com/questions/tagged/regexNow, if I apply the regex below over it...
(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?... I would get the following result:
Match "http://stackoverflow.com/" Group 1: "http" Group 2: "stackoverflow.com" Group 3: "/" Match "https://stackoverflow.com/questions/tagged/regex" Group 1: "https" Group 2: "stackoverflow.com" Group 3: "/questions/tagged/regex"But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group
(?:)
.(?:https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?Now, my result looks like this:
Match "http://stackoverflow.com/" Group 1: "stackoverflow.com" Group 2: "/" Match "https://stackoverflow.com/questions/tagged/regex" Group 1: "stackoverflow.com" Group 2: "/questions/tagged/regex"See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.
EDIT:As requested, let me try to explain groups too.
Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let's try some examples, shall we?
Imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job , but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):
\<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\> or \<(.+?)\> [^<]*? \</\1\>The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).
Let's try some substitutions now. Consider the following text:
Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.Now, let's use this dumb regex over it:
\b(\S)(\S)(\S)(\S*)\bThis regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:
Match "Lorem" Group 1: "L" Group 2: "o" Group 3: "r" Group 4: "em" Match "ipsum" Group 1: "i" Group 2: "p" Group 3: "s" Group 4: "um" ... Match "consectetuer" Group 1: "c" Group 2: "o" Group 3: "n" Group 4: "sectetuer" ...So, if we apply the substitution string:
$1_$3$2_$4... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.
L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.You can use named groups for substitutions too, using
${name}
.To play around with regexes, I recommend http://regex101.com/ , which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.
Ricardo Nolde ,
@ajsie: Traditional (capturing) groups are most useful if you're performing a replacement operation on the results. Here's an example where I'm grabbing comma-separated last & first names and then reversing their order (thanks to named groups)... regexhero.net/tester/?id=16892996-64d4-4f10-860a-24f28dad7e30 – Steve Wortham Aug 19 '10 at 15:43You can use capturing groups to organize and parse an expression. A non-capturing group has the first benefit, but doesn't have the overhead of the second. You can still say a non-capturing group is optional, for example.
Say you want to match numeric text, but some numbers could be written as 1st, 2nd, 3rd, 4th,... If you want to capture the numeric part, but not the (optional) suffix you can use a non-capturing group.
([0-9]+)(?:st|nd|rd|th)?That will match numbers in the form 1, 2, 3... or in the form 1st, 2nd, 3rd,... but it will only capture the numeric part.
Nov 11, 2019 | docstore.mik.ua
Problem You have a pattern with a greedy quantifier like
*
,+
,?
, or{}
, and you want to stop it from being greedy.A classic case of this is the naОve substitution to remove tags from HTML. Although it looks appealing,
s#<TT>.*</TT>##gsi
, actually deletes everything from the first openTT
tag through the last closing one. This would turn"Even
< TT>vi</TT>
can
edit
< TT>troff</TT>
effectively."
into"Even
effectively"
, completely changing the meaning of the sentence! SolutionReplace the offending greedy quantifier with the corresponding non-greedy version. That is, change
*
,+
,?
, and{}
into*?
,+?
,??
, and{}?
, respectively. DiscussionPerl has two sets of quantifiers: the maximal ones
*
,+
,?
, and{}
(sometimes called greedy ) and the minimal ones*?
,+?
,??
,and
{}?
(sometimes called stingy ). For instance, given the string"Perl
is
a
Swiss
Army
Chainsaw!"
, the pattern/(r.*s)/
matches"rl
is
a
Swiss
Army
Chains"
whereas/(r.*?s)/
matches"rl
is"
.With maximal quantifiers, when you ask to match a variable number of times, such as zero or more times for
*
or one or more times for+
, the matching engine prefers the "or more" portion of that description. Thus/foo.*bar/
matches from the first"foo"
up to the last"bar"
in the string, rather than merely the next"bar"
, as some might expect. To make any of the regular expression repetition operators prefer stingy matching over greedy matching, add an extra?
. So*?
matches zero or more times, but rather than match as much as it possibly can the way*
would, it matches as little as possible.# greedy pattern s/<.*>//gs; # try to remove tags, very badly # non-greedy pattern s/<.*?>//gs; # try to remove tags, still rather badlyThis approach doesn't remove tags from all possible HTML correctly, because a single regular expression is not an acceptable replacement for a real parser. See Recipe 20.6 for the right way to do this.
Minimal matching isn't all it's cracked up to be. Don't fall into the trap of thinking that including the partial pattern
BEGIN.*?END
in a pattern amidst other elements will always match the shortest amount of text between occurrences ofBEGIN
andEND
. Imagine the pattern/BEGIN(.*?)END/
. If matched against the string"BEGIN
and
BEGIN
and
END"
,$1
would contain"and
BEGIN
and"
. This is probably not what you want.Imagine if we were trying to pull out everything between bold-italic pairs:
<b><i>this</i> and <i>that</i> are important</b> Oh, <b><i>me too!</i></b>A pattern to find only text between bold-italic HTML pairs, that is, text that doesn't include them, might appear to be this one:
m{ <b><i>(.*?)</i></b> }sxYou might be surprised to learn that the pattern doesn't do that. Many people incorrectly understand this as matching a
"<b><i>"
sequence, then something that's not"<b><i>"
, and then"</i></b>"
, leaving the intervening text in$1
. While often it works out that way due to the input data, that's not really what it says. It just matches the shortest leftmost substring that satisfies the entire pattern . In this case, that's the entire string. If the intention were to extract only stuff between"<b><i>"
and its corresponding"</i></b>"
, with no other bold-italic tags in between, it would be incorrect.If the string in question is just one character, a negated class is remarkably more efficient than a minimal match, as in
/X([^X]*)X/
. But the general way to say "match BEGIN, then not BEGIN, then END" for any arbitrary values of BEGIN and END is as follows (this also stores the intervening part in$1
):/BEGIN((?:(?!BEGIN).)*)END/Applying this to the HTML-matching code, we end up with something like:
m{ <b><i>( (?: (?!</b>|</i>). )* ) </i></b> }sxor perhaps:
m{ <b><i>( (?: (?!</[ib]>). )* ) </i></b> }sxJeffrey Friedl points out that this quick-and-dirty method isn't particularly efficient. He suggests crafting a more elaborate pattern when speed really matters, such as:
m{ <b><i> [^<]* # stuff not possibly bad, and not possibly the end. (?: # at this point, we can have '<' if not part of something bad (?! </?[ib]> ) # what we can't have < # okay, so match the '<' [^<]* # and continue with more safe stuff ) * </i></b> }sxThis is a variation on Jeffrey's unrolling-the-loop technique, described in Chapter 5 of Mastering Regular Expressions .
Sep 21, 2019 | perlmaven.com
... ... ... Character Classes
Regex Character Classes and Special Character classes .
[bgh.] One of the characters listed in the character class b,g,h or . in this case. [b-h] The same as [bcdefgh]. [a-z] Lower case Latin letters. [bc-] The characters b, c or - (dash). [^bx] Complementary character class. Anything except b or x. \w Word characters: [a-zA-Z0-9_]. \d Digits: [0-9] \s [\f\t\n\r ] form-feed, tab, newline, carriage return and SPACE \W The complementary of \w: [^\w] \D [^\d] \S [^\s] [:class:] POSIX character classes (alpha, alnum...) \p{...} Unicode definitions (IsAlpha, IsLower, IsHebrew, ...) \P{...} Complementary Unicode character classes.TODO: add examples \w and \d matching unicode letters and numebers. Quantifiersa? 0-1 'a' characters a+ 1-infinite 'a' characters a* 0-infinite 'a' characters a{n,m} n-m 'a' characters a{n,} n-infinite 'a' characters a{n} n 'a' characters"Quantifier-modifier" aka. Minimal Matchinga+? a*? a{n,m}? a{n,}? a?? a{n}?Other| AlternationGrouping and capturing(...) Grouping and capturing \1, \2, \3, \4 ... Capture buffers during regex matching $1, $2, $3, $4 ... Capture variables after successful matching (?:...) Group without capturing (don't set \1 nor $1)Anchors^ Beginning of string (or beginning of line if /m enabled) $ End of string (or end of line if /m enabled) \A Beginning of string \Z End of string (or before new-line) \z End of string \b Word boundary (start-of-word or end-of-word) \G Match only at pos(): at the end-of-match position of prior m//gModifiers/m Change ^ and $ to match beginning and end of line respectively /s Change . to match new-line as well /i Case insensitive pattern matching /x Extended pattern (disregard white-space, allow comments starting with #)Extended(?#text) Embedded comment (?adlupimsx-imsx) One or more embedded pattern-match modifiers, to be turned on or off. (?:pattern) Non-capturing group. (?|pattern) Branch test. (?=pattern) A zero-width positive look-ahead assertion. (?!pattern) A zero-width negative look-ahead assertion. (?<=pattern) A zero-width positive look-behind assertion. (?<!pattern) A zero-width negative look-behind assertion.(?'NAME'pattern) (?<NAME>pattern) A named capture group. \k<NAME> \k'NAME' Named backreference.(?{ code }) Zero-width assertion with code execution. (??{ code }) A "postponed" regular subexpression with code execution.Other Regex related articles Official documentation
Sep 17, 2019 | stackoverflow.com
Ask Question Asked 10 years, 1 month ago Active 10 years, 1 month ago Viewed 2k times 2
dlw ,Aug 16, 2009 at 3:52
I need some Perl regular expression help. The following snippet of code:use strict; use warnings; my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; my $word = "plus"; my @results = (); 1 while $str =~ s/(.{2}\b$word\b.{2})/push(@results,"$1\n")/e; print @results;Produces the following output:
A plus B D plus E 2 plus F H plus I 4 plus J 5 plus KWhat I want to see is this, where a character already matched can appear in a new match in a different context:
A plus B D plus E E plus F H plus I I plus J J plus KHow do I change the regular expression to get this result? Thanks --- Dan
Michael Carman ,Aug 16, 2009 at 4:11
General advice: Don't uses///
when you wantm//
. Be specific in what you match.The answer is
pos
:#!/usr/bin/perl -l use strict; use warnings; my $str = 'In this example, ' . 'A plus B equals C, ' . 'D plus E plus F equals G ' . 'and H plus I plus J plus K equals L'; my $word = "plus"; my @results; while ( $str =~ /([A-Z] $word [A-Z])/g ) { push @results, $1; pos($str) -= 1; } print "'$_'" for @results;Output:
C:\Temp> b 'A plus B' 'D plus E' 'E plus F' 'H plus I' 'I plus J' 'J plus K'Michael Carman ,Aug 16, 2009 at 2:56
You can use am//g
instead ofs///
and assign to thepos
function to rewind the match location before the second term:use strict; use warnings; my $str = 'In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L'; my $word = 'plus'; my @results; while ($str =~ /(.{2}\b$word\b(.{2}))/g) { push @results, "$1\n"; pos $str -= length $2; } print @results;dlw ,Aug 18, 2009 at 13:00
Another option is to use a lookahead:use strict; use warnings; my $str = "In this example, A plus B equals C, D plus E " . "plus F equals G and H plus I plus J plus K equals L"; my $word = "plus"; my $chars = 2; my @results = (); push @results, $1 while $str =~ /(?=((.{0,$chars}?\b$word\b).{0,$chars}))\2/g; print "'$_'\n" for @results;Within the lookahead, capturing group 1 matches the word along with a variable number of leading and trailing context characters, up to whatever maximum you've set. When the lookahead finishes, the backreference
\2
matches "for real" whatever was captured by group 2, which is the same as group 1 except that it stops at the end of the word. That setspos
where you want it, without requiring you to calculate how many characters you actually matched after the word.ysth ,Aug 16, 2009 at 9:01
Given the "Full Disclosure" comment (but assuming.{0,35}
, not.{35}
), I'd douse List::Util qw/max min/; my $context = 35; while ( $str =~ /\b$word\b/g ) { my $pre = substr( $str, max(0, $-[0] - $context), min( $-[0], $context ) ); my $post = substr( $str, $+[0], $context ); my $match = substr( $str, $-[0], $+[0] - $-[0] ); $pre =~ s/.*\n//s; $post =~ s/\n.*//s; push @results, "$pre$match$post"; } print for @results;You'd skip the substitutions if you really meant
(?s:.{0,35})
.Greg Hewgill ,Aug 16, 2009 at 2:29
Here's one way to do it:use strict; use warnings; my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; my $word = "plus"; my @results = (); my $i = 0; while (substr($str, $i) =~ /(.{2}\b$word\b.{2})/) { push @results, "$1\n"; $i += $-[0] + 1; } print @results;It's not terribly Perl-ish, but it works and it doesn't use too many obscure regular expression tricks. However, you might have to look up the function of the special variable
@-
inperlvar
.ghostdog74 ,Aug 16, 2009 at 3:44
don't have to use regex. basically, just split up the string, use a loop to go over each items, check for "plus" , then get the word from before and after.my $str = "In this example, A plus B equals C, D plus E plus F equals G and H plus I plus J plus K equals L"; @s = split /\s+/,$str; for($i=0;$i<=scalar @s;$i++){ if ( "$s[$i]" eq "plus" ){ print "$s[$i-1] plus $s[$i+1]\n"; } }
Sep 16, 2019 | www.perlmonks.org
Using Look-ahead and Look-behind by Roy Johnson (Monsignor)
on Dec 21, 2005 at 21:57 UTC ( # 518444 = perltutorial : print w/replies , xml ) Need Help??
- Log in
- Create a new user
- The Monastery Gates
- Super Search
- Seekers of Perl Wisdom
- Meditations
- PerlMonks Discussion
- Obfuscation
- Reviews
- Cool Uses For Perl
- Perl News
- Q&A
- Tutorials
- Poetry
- Recent Threads
- Newest Nodes
- Donate
- What's New
If you are familiar with Perl's regular expressions, you are probably already familiar with zero-width assertions: the ^ indicating the beginning of string and the \b indicating a word boundary are examples. They do not match any characters, but "look around" to see what comes before and/or after the current position.
With the look-ahead and look-behind constructs documented in perlre.html#Extended-Patterns , you can "roll your own" zero-width assertions to fit your needs. You can look forward or backward in the string being processed, and you can require that a pattern match succeed (positive assertion) or fail (negative assertion) there.
Syntax Every extended pattern is written as a parenthetical group with a question mark as the first character. The notation for the look-arounds is fairly mnemonic, but there are some other, experimental patterns that are similar, so it is important to get all the characters in the right order.Notice that the = or ! is always last. The directional indicator is only present in the look-behind, and comes before the positive-or-negative indicator. Common tasks
- (?= pattern )
- is a positive look-ahead assertion
- (?! pattern )
- is a negative look-ahead assertion
- (?<= pattern )
- is a positive look-behind assertion
- (?<! pattern )
- is a negative look-behind assertion
Finding the last occurrenceThere are actually a number of ways to get the last occurrence that don't involve look-around, but if you think of "the last foo" as "foo that isn't followed by a string containing foo", you can express that notion like this: /foo(?!.*foo)/ [download] The regular expression engine will do its best to match .*foo , starting at the end of the string "foo". If it is able to match that, then the negative look-ahead will fail, which will force the engine to progress through the string to try the next foo.Substituting before, after, or between charactersMany substitutions match a chunk of text and then replace part or all of it. You can often avoid that by using look-arounds. For example, if you want to put a comma after every foo: s/(?<=foo)/,/g; # Without lookbehind: s/foo/foo,/g or s/(foo)/$1,/g [download] or to put the hyphen in look-ahead: s/(?<=look)(?=ahead)/-/g; [download] This kind of thing is likely to be the bulk of what you use look-arounds for. It is important to remember that look-behind expressions cannot be of variable length . That means you cannot use quantifiers ( ?, *, +, or {1,5} ) or alternation of different-length items inside them.Matching a pattern that doesn't include another patternYou might want to capture everything between foo and bar that doesn't include baz. The technique is to have the regex engine look-ahead at every character to ensure that it isn't the beginning of the undesired pattern: /foo # Match starting at foo ( # Capture (?: # Complex expression: (?!baz) # make sure we're not at the beginning of baz . # accept any character )* # any number of times ) # End capture bar # and ending at bar /x; [download] Nesting You can put look-arounds inside of other look-arounds. This has been known to induce a flight response in certain readers (me, for example, the first time I saw it), but it's really not such a hard concept. A look-around sub-expression inherits a starting position from the enclosing expression, and can walk all around relative to that position without affecting the position of the enclosing expression. They all have independent (though initially inherited) bookkeeping for where they are in the string.The concept is pretty simple, but the notation becomes hairy very quickly, so commented regular expressions are recommended. Let's look at the real example of Regex to add space after punctuation sign . The poster wants to put a space after any comma (punctuation, actually, but for simplicity, let's say comma) that is not nestled between two digits. Building up the s/// expression:
s/(?<=, # after a comma, (?! # but not matching (?<=\d,) # digit-comma before, AND (?=\d) # digit afterward ) )/ /gx; # substitute a space [download] Note that multiple lookarounds can be used to enforce multiple conditions at the same place, like an AND condition that complements the alternation (vertical bar)'s OR. In fact, you can use Boolean algebra ( NOT (a AND b) === (NOT a OR NOT b) ) to convert the expression to use OR: s/(?<=, # after a comma, but either (?: (?<!\d,) # not matching digit-comma before | # OR (?!\d) # not matching digit afterward ) )/ /gx; # substitute a space [download] Capturing It is sometimes useful to use capturing parentheses within a look-around. You might think that you wouldn't be able to do that, since you're just browsing, but you can . But remember: the capturing parentheses must be within the look-around expression; from the enclosing expression's point of view, no actual matching was done by the zero-width look-around.This is most useful for finding overlapping matches in a global pattern match. You can capture substrings without consuming them, so they are available for further matching later. Probably the simplest example is to get all right-substrings of a string:
print "$1\n" while /(?=(.*))/g; [download] Note that the pattern technically consumes no characters at all, but Perl knows to advance a character on an empty match, to prevent infinite looping.
jds17 (Pilgrim) on May 07, 2009 at 16:13 UTC
Re: Using Look-ahead and Look-behindThank you for your very nice article, I certainly learned some new tricks!
Just one little comment: The code in the last paragraph did not work because by default regular expressions are greedy. (Did this change with the Perl versions in between?) The only right-substring that comes out is the full string:
$_ = "Hello"; print "$1\n" while /(?=(.*))/g; [download] Output: Hello [download] Making the "(.*)" part non-greedy fixes it (in Perl 5.10): $_ = "Hello"; print "$1\n" while /(?=(.*)?)/g; [download] Output: Hello ello llo lo o [download]Roy Johnson (Monsignor) on May 08, 2009 at 14:39 UTC
Regex bug in 5.10 (was: Using Look-ahead and Look-behind)I think you have found a bug in 5.10's regex handling. The lookahead's greediness or non-greediness should not matter, because it does not consume any characters. When used in a global match, patterns that do not consume characters should advance one character on each match. At least that's how I read the documentation .
by Roy Johnson (Monsignor) on May 08, 2009 at 14:39 UTCThe really interesting thing about your version is that you didn't make the capture non-greedy, you made it optional. You probably meant (.*?) , which (in pre-5.10) will output empty strings every time. I haven't installed 5.10 myself, so I can't play with it right now.
Caution: Contents may have been coded under pressure.almut (Canon) on May 08, 2009 at 15:06 UTC
Re: Regex bug in 5.10 (was: Using Look-ahead and Look-behind)
by almut (Canon) on May 08, 2009 at 15:06 UTC...which (in pre-5.10) will output empty strings every time.With 5.10.0, /(?=(.*?))/g; outputs one empty string. And I can confirm the behavior reported by jds17 with /(?=(.*))/g .
jds17 (Pilgrim) on May 08, 2009 at 19:47 UTC
Re: Regex bug in 5.10 (was: Using Look-ahead and Look-behind)
by jds17 (Pilgrim) on May 08, 2009 at 19:47 UTC You are right, my change did not affect greediness. The bad thing is: now I don't understand why my proposed solution worked at all. Maybe someone can explain? I don't think the question is too important, but I like to use regular expressions and it bugs me a little if I cannot understand one (especially such a tiny one).I have read the documentation you have cited and it helped, so I played around some more and tried out the following, which only exchanges "+" for "*" in your original expression, really works as one would think and therefore would be my preferred solution, at least for Perl 5.10:
$_ = "Hello"; print "$1\n" while /(?=(.+))/g; [download] Output: Hello ello llo lo o [download]Anonymous Monk on Jun 25, 2011 at 07:49 UTC
Re: Using Look-ahead and Look-behindThe following is just not working. Basically, i want to match a value that has "equity",but NOT "private equity". The result must be items 1, 2, 4, 5. Please check this out:
my %hash = ( 1 => 'equity, private equity', 2 => 'equity', 3 => 'private equity', 4 => 'private equity,equity', 5 => 'private equity, equity', 6 => 'equity,private equity', 7 => 'private equity', 8 => 'mutual funds', 9 => 'cds' ); while (my ($k, $v) = each %hash) { next unless $v =~ m/(?!private\s+)equity/; printf("%d -> %s\n", $k, $v); } [download]Anonymous Monk on Jun 25, 2011 at 08:41 UTC
Re^2: Using Look-ahead and Look-behind
by Anonymous Monk on Jun 25, 2011 at 08:41 UTCHi, new questions go in Seekers Of Perl Wisdom because
Roy Johnson , whom you asked a question, hasn't been here in 6 weeks.
You used code tags and put your code in between, that is awesome :)
Welcome, see How do I post a question effectively? , Where should I post X?
The regex which is not working for you, contains A zero-width negative look-ahead assertion , and like perlre # (?!pattern) says
So, use a look-behindA zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use look-behind instead (see below).
But, that probably won't work either, because you can't have variable length length lookbehind , so you need to use a fixed width lookbehind.
#!/usr/bin/perl -- use strict; use warnings; use Test::More qw' no_plan '; Main(@ARGV); exit(0); sub Main { my @yesWant = ( 'equity, private equity', 'equity', 'private equity,equity', 'private equity, equity', 'equity,private equity', ); my @notWant = ( 'private equity', 'private equity', 'mutual funds', 'cds', ); for my $not ( @notWant ){ ok( (not TestEquity($not)), "not '$not'" ); } for my $yes ( @yesWant ){ ok( TestEquity($yes), "yes '$yes'" ); } } sub TestEquity { return 1 if $_[0] =~ m/(?<!private\s)equity/; return 0; } __END__ $ prove -v pm911357.lookbehind.pl pm911357.lookbehind.pl .. ok 1 - not 'private equity' ok 2 - not 'private equity' ok 3 - not 'mutual funds' ok 4 - not 'cds' ok 5 - yes 'equity, private equity' ok 6 - yes 'equity' ok 7 - yes 'private equity,equity' ok 8 - yes 'private equity, equity' ok 9 - yes 'equity,private equity' 1..9 ok All tests successful. Files=1, Tests=9, 0 wallclock secs ( 0.06 usr + 0.01 sys = 0.08 CPU + ) Result: PASS [download]If fixed width lookbehind doesn't work for you, simply do TWO tests
AnomalousMonk (Bishop) on Jun 25, 2011 at 19:51 UTC
Re^3: Using Look-ahead and Look-behind
by AnomalousMonk (Bishop) on Jun 25, 2011 at 19:51 UTCHere's a solution that exactly matches the phrases specified in AnonyMonk's Re: Using Look-ahead and Look-behind post (which the code of Re^2: Using Look-ahead and Look-behind does not quite do), and also shows how to use the newfangled backtracking control verbs of 5.10 to emulate variable-width negative look-behind. Variable-width positive look-behind is emulated by 5.10's \K assertion.
Explanation:
>perl -wMstrict -le "use Test::More 'no_plan'; ;; for my $ar_vector ( [ YES => 'equity, private equity', ], [ YES => 'equity', ], [ no => 'private equity', ], [ YES => 'private equity,equity', ], [ YES => 'private equity, equity', ], [ no => 'equity,private equity', ], [ no => 'private equity', ], [ no => 'mutual funds', ], [ no => 'cds' ], ) { my ($expected, $string) = @$ar_vector; is match($string), $expected, qq{'$string'}; } ;; sub match { my ($string) = @_; ;; my $char_not_comma_or_space = qr{ [^,\s] }xms; my $private = qr{ private \s+ }xms; return 'YES' if $string =~ m{ (?: $char_not_comma_or_space | $private) equity (*SKIP)(*FAIL) | equity (?! , \S) }xms; return 'no', } " ok 1 - 'equity, private equity' ok 2 - 'equity' ok 3 - 'private equity' ok 4 - 'private equity,equity' ok 5 - 'private equity, equity' ok 6 - 'equity,private equity' ok 7 - 'private equity' ok 8 - 'mutual funds' ok 9 - 'cds' 1..9 [download]
- Any 'equity' that is preceded by
FAILS and is skipped over (this test has first precedence);
- either a character that is not a comma or whitespace, or
- by the 'private' phrase
- Otherwise, any 'equity' that is not followed by a comma that is then followed by any non-whitespace SUCCEEDS.
JohnN (Initiate) on Oct 15, 2012 at 15:09 UTC
Re^4: Using Look-ahead and Look-behind
by JohnN (Initiate) on Oct 15, 2012 at 15:09 UTCchoroba (Bishop) on Oct 15, 2012 at 15:25 UTC
Re^5: Using Look-ahead and Look-behind
by choroba (Bishop) on Oct 15, 2012 at 15:25 UTCAnonymous Monk on Oct 15, 2012 at 15:28 UTC
Re^5: Using Look-ahead and Look-behind
by Anonymous Monk on Oct 15, 2012 at 15:28 UTCAnonymous Monk on Jun 25, 2011 at 10:31 UTC
Re^3: Using Look-ahead and Look-behind
by Anonymous Monk on Jun 25, 2011 at 10:31 UTC Nice. Very nice! You nailed. It's working. Thanks a bunch!heyjoec (Initiate) on Jun 19, 2014 at 11:18 UTC
Re^3: Using Look-ahead and Look-behind
by heyjoec (Initiate) on Jun 19, 2014 at 11:18 UTCI changed the sub TestEquity to allow for any text between Private and Equity, but I can't get it to work. What have I done wrong?
sub TestEquity { return 1 if $_[0] =~ m/(?<!private).*equity/; return 0; } [download]AnomalousMonk (Bishop) on Jun 19, 2014 at 12:09 UTC
Re^4: Using Look-ahead and Look-behind
by AnomalousMonk (Bishop) on Jun 19, 2014 at 12:09 UTCAnonymous Monk on Jun 19, 2014 at 23:13 UTC
Re^4: Using Look-ahead and Look-behind
by Anonymous Monk on Jun 19, 2014 at 23:13 UTCAnonymous Monk on Apr 11, 2007 at 07:25 UTC
Re: Using Look-ahead and Look-behindGreat! This is exactly what I was looking for. Thank you very much!
narainhere (Monk) on Oct 17, 2007 at 12:51 UTC
Re: Using Look-ahead and Look-behindThanks a lot DUDEEEEEEE........ Solved my problems ++ Roy Johnson
The world is so big for any individual to conquerjoewong (Initiate) on Nov 12, 2007 at 03:34 UTC
Re: Using Look-ahead and Look-behindReferring to the paragraph "Matching a pattern that doesn't include another pattern", I wonder why ?: is necessary. It seems to be working for me even without ?:. Please explain. thanks.
Roy Johnson (Monsignor) on Nov 12, 2007 at 17:54 UTC
Re^2: Using Look-ahead and Look-behindGenerally, the decision about using ?: is not about whether it's necessary, but that capturing whatever you're grouping isn't necessary. The ?: modifier makes parentheses not capture, which is somewhat more efficient and might make the task of counting left parentheses less onerous.
by Roy Johnson (Monsignor) on Nov 12, 2007 at 17:54 UTCBy the way, it's not a lookaround feature.
Caution: Contents may have been coded under pressure.greengaroo (Hermit) on Feb 05, 2013 at 15:08 UTC
Re: Using Look-ahead and Look-behindThank you!
Testing never proves the absence of faults, it only shows their presence.
-- greengarooaaaone (Initiate) on Jul 18, 2008 at 13:12 UTC
Re: Using Look-ahead and Look-behindGreat article :) Thank you!
Sep 17, 2008 | stackoverflow.com
Michael Carman ,Sep 17, 2008 at 20:58
I need to write a function that receives a string and a regex. I need to check if there is a match and return the start and end location of a match. (The regex was already compiled byqr//
.)The function might also receive a "global" flag and then I need to return the (start,end) pairs of all the matches.
I cannot change the regex, not even add
()
around it as the user might use()
and\1
. Maybe I can use(?:)
.Example: given "ababab" and the regex
qr/ab/
, in the global case I need to get back 3 pairs of (start, end).Nick T ,Sep 8, 2015 at 19:58
The built-in variables@-
and@+
hold the start and end positions, respectively, of the last successful match.$-[0]
and$+[0]
correspond to entire pattern, while$-[N]
and$+[N]
correspond to the$N
($1
,$2
, etc.) submatches.szabgab ,Sep 17, 2008 at 23:51
Forget my previous post, I've got a better idea.sub match_positions { my ($regex, $string) = @_; return if not $string =~ /$regex/; return ($-[0], $+[0]); } sub match_all_positions { my ($regex, $string) = @_; my @ret; while ($string =~ /$regex/g) { push @ret, [ $-[0], $+[0] ]; } return @ret }This technique doesn't change the the regex in any way.
Edited to add: to quote from perlvar on $1..$9. "These variables are all read-only and dynamically scoped to the current BLOCK." In other words, if you want to use $1..$9, you cannot use a subroutine to do the matching.
Aftershock ,Dec 23, 2012 at 12:13
The pos function gives you the position of the match. If you put your regex in parentheses you can get the length (and thus the end) usinglength $1
. Like thissub match_positions { my ($regex, $string) = @_; return if not $string =~ /($regex)/; return (pos($string), pos($string) + length $1); } sub all_match_positions { my ($regex, $string) = @_; my @ret; while ($string =~ /($regex)/g) { push @ret, [pos($string), pos($string) + length $1]; } return @ret }zigdon ,Sep 17, 2008 at 20:43
You can also use the deprecated $` variable, if you're willing to have all the REs in your program execute slower. From perlvar:$' The string preceding whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval enclosed by the current BLOCK). (Mnemonic: "`" often precedes a quoted string.) This variable is read-only. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS".Shicheng Guo ,Jan 22, 2016 at 0:16
#!/usr/bin/perl # search the postions for the CpGs in human genome sub match_positions { my ($regex, $string) = @_; return if not $string =~ /($regex)/; return (pos($string), pos($string) + length $1); } sub all_match_positions { my ($regex, $string) = @_; my @ret; while ($string =~ /($regex)/g) { push @ret, [(pos($string)-length $1),pos($string)-1]; } return @ret } my $regex='CG'; my $string="ACGACGCGCGCG"; my $cgap=3; my @pos=all_match_positions($regex,$string); my @hgcg; foreach my $pos(@pos){ push @hgcg,@$pos[1]; } foreach my $i(0..($#hgcg-$cgap+1)){ my $len=$hgcg[$i+$cgap-1]-$hgcg[$i]+2; print "$len\n"; }
Sep 16, 2019 | perldoc.perl.org
Capture groups
The grouping construct
( ... )
creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on, within the same pattern, use\ g1
(or\ g { 1 }
) for the first,\ g2
(or\ g { 2 }
) for the second, and so on. This is called a backreference . There is no limit to the number of captured substrings that you may use. Groups are numbered with the leftmost open parenthesis being number 1, etc . If a group did not match, the associated backreference won't match either. (This can happen if the group is optional, or in a different branch of an alternation.) You can omit the"g"
, and write"\1"
, etc , but there are some issues with this form, described below.You can also refer to capture groups relatively, by using a negative number, so that
\ g - 1
and\ g { -1 }
both refer to the immediately preceding capture group, and\ g - 2
and\ g { -2 }
both refer to the group before it. For example:
- /
- (Y) # group 1
- ( # group 2
- (X) # group 3
- \g{-1} # backref to group 3
- \g{-3} # backref to group 1
- )
- /x
would match the same as
/(Y) ( (X) \g3 \g1 )/x
. This allows you to interpolate regexes into larger regexes and not have to worry about the capture groups being renumbered.You can dispense with numbers altogether and create named capture groups. The notation is
(?< name >...)
to declare and\g{ name }
to reference. (To be compatible with .Net regular expressions,\g{ name }
may also be written as\k{ name }
,\k< name >
or\k' name '
.) name must not begin with a number, nor contain hyphens. When different groups within the same pattern have the same name, any reference to that name assumes the leftmost defined group. Named groups count in absolute and relative numbering, and so can also be referred to by those numbers. (It's possible to do things with named capture groups that would otherwise require( ?? {})
.)Capture group contents are dynamically scoped and available to you outside the pattern until the end of the enclosing block or until the next successful match, whichever comes first. (See Compound Statements in perlsyn .) You can refer to them by absolute number (using
"$1"
instead of"\g1"
, etc ); or by name via the%+
hash, using"$+{ name }"
.Braces are required in referring to named capture groups, but are optional for absolute or relative numbered ones. Braces are safer when creating a regex by concatenating smaller strings. For example if you have
qr/$a$b/
, and$a
contained"\g1"
, and$b
contained"37"
, you would get/\g137/
which is probably not what you intended.The
\ g
and\ k
notations were introduced in Perl 5.10.0. Prior to that there were no named nor relative numbered capture groups. Absolute numbered groups were referred to using\ 1
,\ 2
, etc ., and this notation is still accepted (and likely always will be). But it leads to some ambiguities if there are more than 9 capture groups, as\ 10
could mean either the tenth capture group, or the character whose ordinal in octal is 010 (a backspace in ASCII). Perl resolves this ambiguity by interpreting\ 10
as a backreference only if at least 10 left parentheses have opened before it. Likewise\ 11
is a backreference only if at least 11 left parentheses have opened before it. And so on.\ 1
through\ 9
are always interpreted as backreferences. There are several examples below that illustrate these perils. You can avoid the ambiguity by always using\ g {}
or\ g
if you mean capturing groups; and for octal constants always using\ o {}
, or for\ 077
and below, using 3 digits padded with leading zeros, since a leading zero implies an octal constant.The
\ digit
notation also works in certain circumstances outside the pattern. See Warning on \1 Instead of $1 below for details.Examples:
- s/^([^ ]*) *([^ ]*)/$2 $1/ ; # swap first two words
- /(.)\g1/ # find first doubled char
- and print "'$1' is the first doubled character\n" ;
- /(?<char>.)\k<char>/ # ... a different way
- and print "'$+{char}' is the first doubled character\n" ;
- /(?'char'.)\g1/ # ... mix and match
- and print "'$1' is the first doubled character\n" ;
- if ( /Time: (..):(..):(..)/ ) { # parse out values
- $hours = $1 ;
- $minutes = $2 ;
- $seconds = $3 ;
- }
- /(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/ # \g10 is a backreference
- / ( . )( . )( . )( . )( . )( . )( . )( . )( . ) \ 10 / # \10 is octal
- /((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/ # \10 is a backreference
- / (( . )( . )( . )( . )( . )( . )( . )( . )( . )) \ 010 / # \010 is octal
- $a = '(.)\1' ; # Creates problems when concatenated.
- $b = '(.)\g{1}' ; # Avoids the problems.
- "aa" =~ /${a}/ ; # True
- "aa" =~ /${b}/ ; # True
- "aa0" =~ /${a}0/ ; # False!
- "aa0" =~ /${b}0/ ; # True
- "aa\x08" =~ /${a}0/ ; # True!
- "aa\x08" =~ /${b}0/ ; # False
Several special variables also refer back to portions of the previous match.
$+
returns whatever the last bracket match matched.$&
returns the entire matched string. (At one point$0
did also, but now it returns the name of the program.)$`
returns everything before the matched string.$'
returns everything after the matched string. And$^N
contains whatever was matched by the most-recently closed group (submatch).$^N
can be used in extended patterns (see below), for example to assign a submatch to a variable.These special variables, like the
%+
hash and the numbered match variables ($1
,$2
,$3
, etc .) are dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See Compound Statements in perlsyn .)NOTE : Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.
WARNING : If your code is to run on Perl 5.16 or earlier, beware that once Perl sees that you need one of
$&
,$`
, or$'
anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program.Perl uses the same mechanism to produce
$1
,$2
, etc , so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression( ?: ... )
instead.) But if you never use$&
,$`
or$'
, then patterns without capturing parentheses will not be penalized. So avoid$&
,$'
, and$`
if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price.Perl 5.16 introduced a slightly more efficient mechanism that notes separately whether each of
$`
,$&
, and$'
have been seen, and thus may only need to copy part of the string. Perl 5.20 introduced a much more efficient copy-on-write mechanism which eliminates any slowdown.As another workaround for this problem, Perl 5.10.0 introduced
$ { ^PREMATCH }
,$ { ^MATCH }
and$ { ^POSTMATCH }
, which are equivalent to$`
,$&
and$'
, except that they are only guaranteed to be defined after a successful match that was executed with the/p
(preserve) modifier. The use of these variables incurs no global performance penalty, unlike their punctuation character equivalents, however at the trade-off that you have to tell perl when you want to use them. As of Perl 5.20, these three variables are equivalent to$`
,$&
and$'
, and/p
is ignored.
Sep 02, 2019 | stackoverflow.com
mirod ,Jun 15, 2011 at 17:21
I have this regex:if($string =~ m/^(Clinton|[^Bush]|Reagan)/i) {print "$string\n"};I want to match with Clinton and Reagan, but not Bush.
It's not working.
Calvin Taylor ,Jul 14, 2017 at 21:03
Sample text:Clinton said
Bush used crayons
Reagan forgotJust omitting a Bush match:
$ perl -ne 'print if /^(Clinton|Reagan)/' textfile Clinton said Reagan forgotOr if you really want to specify:
$ perl -ne 'print if /^(?!Bush)(Clinton|Reagan)/' textfile Clinton said Reagan forgotGuruM ,Oct 27, 2012 at 12:54
Your regex does not work because [] defines a character class, but what you want is a lookahead:(?=) - Positive look ahead assertion foo(?=bar) matches foo when followed by bar (?!) - Negative look ahead assertion foo(?!bar) matches foo when not followed by bar (?<=) - Positive look behind assertion (?<=foo)bar matches bar when preceded by foo (?<!) - Negative look behind assertion (?<!foo)bar matches bar when NOT preceded by foo (?>) - Once-only subpatterns (?>\d+)bar Performance enhancing when bar not present (?(x)) - Conditional subpatterns (?(3)foo|fu)bar - Matches foo if 3rd subpattern has matched, fu if not (?#) - Comment (?# Pattern does x y or z)So try: (?!bush)
Nov 30, 2017 | stackoverflow.com
Wakan Tanka ,Mar 20, 2015 at 13:17
I've come across following materials:
- Mastering Perl by brian d foy , chapter: Debugging Regular Expressions.
- Debugging regular expressions which mentions
re::debug
module for perlI've also try to use various another techniques:
but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of
use re 'debug'
or another command used for debugging regular expressions in perl?EDIT in reply to Borodin:
1st example:
perl -Mre=debug -e' "foobar"=~/(.)\1/' Compiling REx "(.)\1" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: CLOSE1 (6) 6: REF1 (8) 8: END (0) minlen 1 Matching REx "(.)\1" against "foobar" 0 <> <foobar> | 1:OPEN1(3) 0 <> <foobar> | 3:REG_ANY(4) 1 <f> <oobar> | 4:CLOSE1(6) 1 <f> <oobar> | 6:REF1(8) failed... 1 <f> <oobar> | 1:OPEN1(3) 1 <f> <oobar> | 3:REG_ANY(4) 2 <fo> <obar> | 4:CLOSE1(6) 2 <fo> <obar> | 6:REF1(8) 3 <foo> <bar> | 8:END(0) Match successful! Freeing REx: "(.)\1"
- What does OPEN1, REG_ANY, CLOSE1 ... mean ?
- What numbers like 1 3 4 6 8 mean?
- What does number in braces OPEN1(3) mean?
- Which output should I look at, Compiling REx or Matching REx?
2nd example:
perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/' Compiling REx "(.*)\1" Final program: 1: OPEN1 (3) 3: STAR (5) 4: REG_ANY (0) 5: CLOSE1 (7) 7: REF1 (9) 9: END (0) minlen 0 Matching REx "(.*)\1" against "foobar" 0 <foobar>| 1:OPEN1(3) 0 <foobar>| 3:STAR(5) REG_ANY can match 6 times out of 2147483647... 6 <foobar>| 5: CLOSE1(7) 6 <foobar>| 7: REF1(9) failed... 5 <foobar>| 5: CLOSE1(7) 5 <foobar>| 7: REF1(9) failed... 4 <foobar>| 5: CLOSE1(7) 4 <foobar>| 7: REF1(9) failed... 3 <foobar>| 5: CLOSE1(7) 3 <foobar>| 7: REF1(9) failed... 2 <foobar>| 5: CLOSE1(7) 2 <foobar>| 7: REF1(9) failed... 1 <foobar>| 5: CLOSE1(7) 1 <foobar>| 7: REF1(9) failed... 0 <foobar>| 5: CLOSE1(7) 0 <foobar>| 7: REF1(9) 0 <foobar>| 9: END(0) Match successful! Freeing REx: "(.*)\1"
- Why are numbers descending 6 5 4 3 ... in this example?
- What does failed keyword mean?
Borodin ,Mar 20, 2015 at 14:45
Asking how to use a debugger is very broad. Can you show us the pattern that you are trying to debug, and explain what you don't understand? – Borodin Mar 20 '15 at 14:45ThisSuitIsBlackNot ,Mar 20, 2015 at 15:38
When you runperl -Mre=debug
, you're using there
module; you can see the documentation by runningperldoc re
. The section on "debug mode" is a bit sparse, but ends with "See 'Debugging regular expressions' in perldebug for additional info."perldoc perldebug
is similarly short on details, but ends with "These matters are explored in some detail in 'Debugging regular expressions' in perldebguts." And now we have your answer . – ThisSuitIsBlackNot Mar 20 '15 at 15:38axblount ,Mar 20, 2015 at 20:25
Regular expressions define finite state machines 1 . The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character."Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In
/(.*)\1/
:1: OPEN1 (3) 3: STAR (5) 4: REG_ANY (0) 5: CLOSE1 (7)
STAR (5)
means computeSTAR
and once you succeed, go to instruction 5CLOSE1
."Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.
To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for
/(.)\1/
. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.Match +-------+ Anything +----------+ | Start +------------------+ State 1 | +---^---+ +--+---+---+ | | | | | |Matched same +-------------------------+ | character matched different | character +----+------+ | Success | +-----------+We start on
Start
. It's easy to advance to the first state, we just consume any one character (REG_ANY
). The only other thing that could happen is end of input. I haven't drawn that here. TheREG_ANY
instruction is wrapped in the capture group instructions.OPEN1
starts recording all matched characters into the first capture group.CLOSE1
stops recording characters to the first capture group.Once we consume a character, we sit on
State 1
and consume the next char. If it matches the previous char we move to success!REF1
is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to theStart
to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).The example with
*
is more complicated.*
(which corresponds toSTAR
) tries to match the given pattern zero or more times, and it is greedy . That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"
), closes the capture group, and tries to match"foobar"
again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with\1 = ""
.I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.
Here is a finite state machine simulator . You can enter a regex and see it executed. Unfortunately, it doesn't support back references.
1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.
> ,
The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/
Nov 30, 2017 | stackoverflow.com
Wakan Tanka ,Mar 20, 2015 at 13:17
I've come across following materials:
- Mastering Perl by brian d foy , chapter: Debugging Regular Expressions.
- Debugging regular expressions which mentions
re::debug
module for perlI've also try to use various another techniques:
but still did not get the point how to read their output. I've also found another modules used for debugging regular expressions here but I did not tried them yet, can you please explain how to read output of
use re 'debug'
or another command used for debugging regular expressions in perl?EDIT in reply to Borodin:
1st example:
perl -Mre=debug -e' "foobar"=~/(.)\1/' Compiling REx "(.)\1" Final program: 1: OPEN1 (3) 3: REG_ANY (4) 4: CLOSE1 (6) 6: REF1 (8) 8: END (0) minlen 1 Matching REx "(.)\1" against "foobar" 0 <> <foobar> | 1:OPEN1(3) 0 <> <foobar> | 3:REG_ANY(4) 1 <f> <oobar> | 4:CLOSE1(6) 1 <f> <oobar> | 6:REF1(8) failed... 1 <f> <oobar> | 1:OPEN1(3) 1 <f> <oobar> | 3:REG_ANY(4) 2 <fo> <obar> | 4:CLOSE1(6) 2 <fo> <obar> | 6:REF1(8) 3 <foo> <bar> | 8:END(0) Match successful! Freeing REx: "(.)\1"
- What does OPEN1, REG_ANY, CLOSE1 ... mean ?
- What numbers like 1 3 4 6 8 mean?
- What does number in braces OPEN1(3) mean?
- Which output should I look at, Compiling REx or Matching REx?
2nd example:
perl -Mre=debugcolor -e' "foobar"=~/(.*)\1/' Compiling REx "(.*)\1" Final program: 1: OPEN1 (3) 3: STAR (5) 4: REG_ANY (0) 5: CLOSE1 (7) 7: REF1 (9) 9: END (0) minlen 0 Matching REx "(.*)\1" against "foobar" 0 <foobar>| 1:OPEN1(3) 0 <foobar>| 3:STAR(5) REG_ANY can match 6 times out of 2147483647... 6 <foobar>| 5: CLOSE1(7) 6 <foobar>| 7: REF1(9) failed... 5 <foobar>| 5: CLOSE1(7) 5 <foobar>| 7: REF1(9) failed... 4 <foobar>| 5: CLOSE1(7) 4 <foobar>| 7: REF1(9) failed... 3 <foobar>| 5: CLOSE1(7) 3 <foobar>| 7: REF1(9) failed... 2 <foobar>| 5: CLOSE1(7) 2 <foobar>| 7: REF1(9) failed... 1 <foobar>| 5: CLOSE1(7) 1 <foobar>| 7: REF1(9) failed... 0 <foobar>| 5: CLOSE1(7) 0 <foobar>| 7: REF1(9) 0 <foobar>| 9: END(0) Match successful! Freeing REx: "(.*)\1"
- Why are numbers descending 6 5 4 3 ... in this example?
- What does failed keyword mean?
Borodin ,Mar 20, 2015 at 14:45
Asking how to use a debugger is very broad. Can you show us the pattern that you are trying to debug, and explain what you don't understand? – Borodin Mar 20 '15 at 14:45ThisSuitIsBlackNot ,Mar 20, 2015 at 15:38
When you runperl -Mre=debug
, you're using there
module; you can see the documentation by runningperldoc re
. The section on "debug mode" is a bit sparse, but ends with "See 'Debugging regular expressions' in perldebug for additional info."perldoc perldebug
is similarly short on details, but ends with "These matters are explored in some detail in 'Debugging regular expressions' in perldebguts." And now we have your answer . – ThisSuitIsBlackNot Mar 20 '15 at 15:38axblount ,Mar 20, 2015 at 20:25
Regular expressions define finite state machines 1 . The debugger is more or less showing you how the state machine is progressing as the string is consumed character by character."Compiling REx" is the listing of instructions for that regular expression. The number in parenthesis after each instruction is where to go once the step succeeds. In
/(.*)\1/
:1: OPEN1 (3) 3: STAR (5) 4: REG_ANY (0) 5: CLOSE1 (7)
STAR (5)
means computeSTAR
and once you succeed, go to instruction 5CLOSE1
."Matching REx" is the step-by-step execution of those instructions. The number on the left is the total number of characters that have been consumed so far. This number can go down if the matcher has to go backwards because something it tried didn't work.
To understand these instructions, it's important to understand how regular expressions "work." Finite state machines are usually visualized as a kind of flow chart. I have produced a crude one below for
/(.)\1/
. Because of the back reference to a capture group, I don't believe this regex is a strict finite state machine. The chart is useful none the less.Match +-------+ Anything +----------+ | Start +------------------+ State 1 | +---^---+ +--+---+---+ | | | | | |Matched same +-------------------------+ | character matched different | character +----+------+ | Success | +-----------+We start on
Start
. It's easy to advance to the first state, we just consume any one character (REG_ANY
). The only other thing that could happen is end of input. I haven't drawn that here. TheREG_ANY
instruction is wrapped in the capture group instructions.OPEN1
starts recording all matched characters into the first capture group.CLOSE1
stops recording characters to the first capture group.Once we consume a character, we sit on
State 1
and consume the next char. If it matches the previous char we move to success!REF1
is the instruction that attempts to match capture group #1. Otherwise, we failed and need to move back to theStart
to try again. Whenever the matcher says "failed..." it's telling you that something didn't work, so it's returning to an earlier state (that may or may not include 'unconsuming' characters).The example with
*
is more complicated.*
(which corresponds toSTAR
) tries to match the given pattern zero or more times, and it is greedy . That means it tries to match as many characters as it possibly can. Starting at the beginning of the string, it says "I can match up to 6 characters!" So, it matches all 6 characters ("foobar"
), closes the capture group, and tries to match"foobar"
again. That doesn't work! It tries again with 5, that doesn't work. And so on, until it tries to matching zero characters. That means the capture group is empty, matching the empty string always succeeds. So the match succeeds with\1 = ""
.I realize I've spent more time explaining regular expressions than I have Perl's regex debugger. But I think its output will become much more clear once you understand how regexes operate.
Here is a finite state machine simulator . You can enter a regex and see it executed. Unfortunately, it doesn't support back references.
1: I believe some of Perl's regular expression features push it beyond this definition but it's still useful to think about them this way.
> ,
The debug Iinformation contains description of the bytecode. Numbers denote the node indices in the op tree. Numbers in round brackets tell the engine to jump to a specific node upon match. The EXACT operator tells the regex engine to look for a literal string. REG_ANY means the . symbol. PLUS means the +. Code 0 is for the 'end' node. OPEN1 is a '(' symbol. CLOSE1 means ')'. STAR is a '*'. When the matcher reaches the end node, it returns a success code back to Perl, indicating that the entire regex has matched.See more details at http://perldoc.perl.org/perldebguts.html#Debugging-Regular-Expressions and a more conceptual http://perl.plover.com/Rx/paper/
Nov 17, 2017 | www.amazon.com
Regex Modifiers
Several modifiers change the behavior of the regular expression operators. These modifiers appear at the end of the match, substitution, and qr// operators. For example, here's how to enable case-insensitive matching:
my $pet = 'ELLie' ; like $pet, qr /Ellie/, 'Nice puppy!' ; like $pet, qr /Ellie/i, 'shift key br0ken' ; The first like() will fail because the strings contain different letters. The second like() will pass, because the /i modifier causes the regex to ignore case distinctions. and are effectively equivalent in the second regex due to the modifier.
You may also embed regex modifiers within a pattern:
my $find_a_cat = qr /(?<feline>(?i)cat)/; The (?i) syntax enables case-insensitive matching only for its enclosing group -- in this case, the named capture. You may use multiple modifiers with this form. Disable specific modifiers by preceding them with the minus character ( ):
my $find_a_rational = qr /(?<number>(?-i)Rat)/;
... ... ...The /e modifier lets you write arbitrary code on the right side of a substitution operation. If the match succeeds, the regex engine will use the return value of that code as the substitution value. The earlier global substitution example could be simpler with code like the following:
# appease the Mitchell estate $sequel =~ {Scarlett( O 'Hara)?} { ' Mauve ' . defined $1 ? ' Midway ' : '' }ge; Each additional occurrence of the /e modifier will cause another evaluation of the result of the expression, though only Perl golfers use anything beyond /ee
Nov 16, 2017 | stackoverflow.com
The match operator in scalar context evaluates to a boolean that indicates whether the match succeeded or not.
my $success = $user =~ /(\d+)/;The match operator in list context returns the captured strings (or
1
if there are no captures) on success and an empty list on error.my ($num) = $user =~ /(\d+)/;You used the former, but you want the latter. That gives you the following (after a few other small fixes):
sub next_level { my ($user) = @_; my ($num) = $user =~ /(\d+)\z/; $user =~ s/\d+\z//g; $user .= ++$num; return $user; }But that approach is complicated and inefficient. Simpler solution:
sub next_level { my ($user) = @_; $user =~ s/(\d+)\z/ $1 + 1 /e; return $user; }
Nov 16, 2017 | stackoverflow.com
sampath, yesterday
I am trying to remove the old files in a dir if the count is more than 3 over SSHKindly suggest how to resolve the issue.
Please refer the code snippet
#!/usr/bin/perl use strict; use warnings; my $HOME="/opt/app/latest"; my $LIBS="${HOME}/libs"; my $LIBS_BACKUP_DIR="${HOME}/libs_backups"; my $a; my $b; my $c; my $d; my $command =qq(sudo /bin/su - jenkins -c "ssh username\@server 'my $a=ls ${LIBS_BACKUP_DIR} | wc -l;my $b=`$a`;if ($b > 3); { print " Found More than 3 back up files , removing older files..";my $c=ls -tr ${LIBS_BACKUP_DIR} | head -1;my $d=`$c`;print "Old file name $d";}else { print "No of back up files are less then 3 .";} '"); print "$command\n"; system($command);output:
sudo /bin/su - jenkins -c "ssh username@server 'my ; =ls /opt/app/latest/libs_backups | wc -l;my ; =``;if ( > 3); { print " Found More than 3 back up files , removing older files..";my ; =ls -tr /opt/app/latest/libs_backups | head -1;my ; =``;print "Old file name ";}else { print "No of back up files are less then 3 .";} '" Found: -c: line 0: unexpected EOF while looking for matching `'' Found: -c: line 1: syntax error: unexpected end of file
janh ,yesterday
Are you trying to execute parts of your local perl script in an ssh session on a remote server? That will not work. – janh yesterdaysimbabque ,yesterday
Look into Object::Remote. Here is a good talk by the author from the German Perl Workshop 2014. It will essentially let you write Perl code locally, and execute it completely on a remote machine. It doesn't even matter what Perl version you have there. – simbabque yesterdaysimbabque ,yesterday
You should also not use$a
and$b
. They are reserved global variables forsort
. – simbabque yesterdayChris Turner ,yesterday
Why are you sudoing when your command is running on an entirely different server? – Chris Turner yesterdayshawnhcorey ,yesterday
Never putsudo
orsu
in a script. This is security breach. Instead run the script assudo
orsu
. – shawnhcorey yesterdayIf you have three levels of escaping, you're bound to get it wrong if you do it manually. Use String::ShellQuote'sshell_quote
instead.Furthermore, avoid generating code. You're bound to get it wrong! Pass the necessary information using arguments, the environment or some other channel of communication instead.
There were numerous errors in the interior Perl script on top of the fact that you tried to execute a Perl script without actually invoking
perl
!#!/usr/bin/perl use strict; use warnings; use String::ShellQuote qw( shell_quote ); my $HOME = "/opt/app/latest"; my $LIBS = "$HOME/libs"; my $LIBS_BACKUP_DIR = "$HOME/libs_backups"; my $perl_script = <<'__EOI__'; use strict; use warnings; use String::ShellQuote qw( shell_quote ); my ($LIBS_BACKUP_DIR) = @ARGV; my $cmd = shell_quote("ls", "-tr", "--", $LIBS_BACKUP_DIR); chomp( my @files = `$cmd` ); if (@files > 3) { print "Found more than 3 back up files. Removing older files...\n"; print "$_\n" for @files; } else { print "Found three or fewer backup files.\n"; } __EOI__ my $remote_cmd = shell_quote("perl", "-e", $perl_script, "--", $LIBS_BACKUP_DIR); my $ssh_cmd = shell_quote("ssh", 'username@server', "--", $remote_cmd); my $local_cmd = shell_quote("sudo", "su", "-c", $ssh_ccmd); system($local_cmd);
Sep 27, 2017 | www.perlmonks.org
Hello perl-diddler ,
If you look at the documentation for qr// , youll see that the /g modifier is not supported:
qr/ STRING /msixpodualn
-- perlop#Regexp-Quote-Like-OperatorsWhich makes sense: qr turns STRING into a regular expression, which may then be used in any number of m{...} and s{...}{...} constructs. The appropriate place to add a /g modifier is at the point of use:
use strict; use warnings; use P; my $re = qr{ (\w+) }x; my $dat = "Just another cats meow"; my @matches = $dat =~ /$re/g; P "#matches=%s, matches=%s", scalar(@matches), \@matches; exit scalar(@matches); [download]Output:
12:53 >perl 1645_SoPW.pl #matches=4, matches=["Just", "another", "cats", "meow"] 12:54 > [download]Update:
P.S. - I also just noticed that in addition to stripping out the 'g' option, the 'x' option doesn't seem to work in the regex's parens, i.e. - (?x).I dont understand what youre saying here. Can you give some example code?
Hope that helps,
May 16, 2017 | perldoc.perl.org
function. For example,
- $x = "cat dog house" ; # 3 words
- while ( $x =~ /(\w+)/g ) {
- print "Word is $1, ends at position " , pos $x , "\n" ;
- }
prints
- Word is cat, ends at position 3
- Word is dog, ends at position 7
- Word is house, ends at position 13
A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the
favorite 2 I have to find all the positions of matching strings within a larger string using a while loop, and as a second method using a foreach loop. I have figured out the while loop method, but I am stuck on a foreach method. Here is the 'while' method://c
, as in/regex/gc
.....
my $sequence = 'AACAAATTGAAACAATAAACAGAAACAAAAATGGATGCGATCAAGAAAAAGATGC' . 'AGGCGATGAAAATCGAGAAGGATAACGCTCTCGATCGAGCCGATGCCGCGGAAGA' . 'AAAAGTACGTCAAATGACGGAAAAGTTGGAACGAATCGAGGAAGAACTACGTGAT' . 'ACCCAGAAAAAGATGATGCNAACTGAAAATGATTTAGATAAAGCACAGGAAGATT' . 'TATCTGTTGCAAATACCAACTTGGAAGATAAGGAAAAGAAAGTTCAAGAGGCGGA' . 'GGCTGAGGTAGCANCCCTGAATCGTCGTATGACACTTCTGGAAGAGGAATTGGAA' . 'CGAGCTGAGGAACGTTTGAAGATTGCAACGGATAAATTGGAAGAAGCAACACATA' . 'CAGCTGATGAATCTGAACGTGTTCGCNAGGTTATGGAAA' ; my $string = < STDIN >; chomp $string ; while ( $sequence =~ / $string / gi ) { printf "Sequence found at position: %d\n" , pos ( $sequence )- length ( $string ); }
Here is my foreach method:
foreach ( $sequence =~ / $string / gi ) printf "Sequence found at position: %d\n" , pos ( $sequence ) - length ( $string ); }
Could someone please give me a clue on why it doesn't work the same way? Thanks!
My Output if I input "aaca":
perl foreachPart 1 using a while loop Sequence found at position : 0 Sequence found at position : 10 Sequence found at position : 17 Sequence found at position : 23 Sequence found at position : 377 Part 2 using a foreach loop Sequence found at position : - 4 Sequence found at position : - 4 Sequence found at position : - 4 Sequence found at position : - 4 Sequence found at position : - 4
share | improve this question asked Jan 31 '11 at 21:38 user83598 66 3 9 add a comment
2 Using raw input $string
in a regexp will act weird if somebody types in special characters (accidentally or maliciously). Consider using/\Q$string/gi
to avoid treating$string
as a regexp. aschepler Jan 31 '11 at 22:15
up vote 9 down vote accepted Your problem here is context. In the while
loop, the condition is in scalar context. In scalar context, the match operator ing
mode will sequentially match along the string. Thus checkingpos
within the loop does what you want.In the
foreach
loop, the condition is in list context. In list context, the match operator ing
mode will return a list of all matches (and it will calculate all of the matches before the loop body is ever entered).foreach
is then loading the matches one by one into$_
for you, but you are never using the variable.pos
in the body of the loop is not useful as it contains the result after the matches have ended.The takeaway here is that if you want
pos
to work, and you are using theg
modifier, you should use thewhile
loop which imposes scalar context and makes the regex iterate across the matches in the string.Sinan inspired me to write a few
foreach
examples:
- This one is fairly succinct using
split
in separator retention mode:my $pos = 0 ; foreach ( split /( $string )/ i => $sequence ) { print "Sequence found at position: $pos\n" if lc eq lc $string ; $pos += length ; }
- A regex equivalent of the
split
solution:my $pos = 0 ; foreach ( $sequence =~ /( \Q$string\E |(?:(?! \Q$string\E ).)+)/ gi ) { print "Sequence found at position: $pos\n" if lc eq lc $string ; $pos += length ; }
- But this is clearly the best solution for your problem:
{ package Dumb :: Homework ; sub TIEARRAY { bless { haystack => $_ [ 1 ], needle => $_ [ 2 ], size => 2 ** 31 - 1 , pos => [], } } sub FETCH { my ( $self , $index ) = @_ ; my ( $pos , $needle ) = @$self { qw ( pos needle )}; return $$pos [ $index ] if $index < @$pos ; while ( $index + 1 >= @$pos ) { unless ( $$self { haystack } =~ / \Q$needle / gi ) { $$self { size } = @$pos ; last } push @$pos , pos ( $$self { haystack }) - length $needle ; } $$pos [ $index ] } sub FETCHSIZE { $_ [ 0 ]{ size }} } tie my @pos , 'Dumb::Homework' => $sequence , $string ; print "Sequence found at position: $_\n" foreach @pos ; # look how clean it is
The reason its the best is because the other two solutions have to process the entire global match first, before you ever see a result. For large inputs (like DNA) that could be a problem. The
Dumb::Homework
package implements an array that will lazily find the next position each time theforeach
iterator asks for it. It will even store the positions so you can get to them again without reprocessing. (In truth it looks one match past the requested match, this allows it to end properly in theforeach
, but still much better than processing the whole list)- Actually, the best solution is still to not use
foreach
as it is not the correct tool for the job.
May 16, 2017 | stackoverflow.com
I think thats exactly what the
pos
function is for.NOTE:
pos
only works if you use the/g
flagmy $x = 'abcdefghijklmnopqrstuvwxyz' ; my $end = 0 ; if ( $x =~ / $ARGV [ 0 ]/ g ) { $end = pos ( $x ); } print "End of match is: $end\n" ;
Gives the following output
[ @centos5 ~] $ perl x . pl End of match is : 0 [ @centos5 ~] $ perl x . pl def End of match is : 6 [ @centos5 ~] $ perl x . pl xyz End of match is : 26 [ @centos5 ~] $ perl x . pl aaa End of match is : 0 [ @centos5 ~] $ perl x . pl ghi End of match is : 9
No, it only works when a match was successful. tripleee Oct 10 '11 at 15:24
Sorry, I misread the question. The actaul question is very tricky, especially if the regex is more complicated than just
/gho/
, especially if it contains[
or(
. Should I delete my irrelevant answer? Sodved Oct 10 '11 at 15:27I liked the possibility to see an example of how
pos
works, as I didn't know about it before - so now I can understand why it also doesn't apply to the question; so thanks for this answer!:)
sdaau Jun 8 '12 at 18:26
May 16, 2017 | stackoverflow.com
Perl - positions of regex match in string Ask Questionif ( my @matches = $input_string =~ / $metadata [ $_ ]{ "pattern" }/ g ) { print $ -[ 1 ] . "\n" ; # this gives me error uninitialized ... }
print scalar @matches;
gaves me 4, that is ok, but if i use$-[1]
to get start of first match, it gaves me error. Where is problem?EDIT1: How i can get positions of each match in string? If i have string "ahoj ahoj ahoj" and regexp /ahoj/g, how i can get positions of start and end of each "ahoj" in string? perl regex
share | improve this question edited Feb 22 '13 at 20:40 asked Feb 22 '13 at 20:27 Krab 2,643 21 48 1 Answer 1 active oldest votes add a comment |
What error does it give you? user554546 Feb 22 '13 at 20:29
$-[1]
is the position of the 1st subpattern (something in parentheses within the regular expression). You're probably looking for$-[0]
, the position of the whole pattern? Scott Lamb Feb 22 '13 at 20:32
scott lamb: no i was thinking if i have string "ahoj ahoj ahoj", then i can get position 0, 5, 10 etc inside $-[n], if regex is /ahoj/g Krab Feb 22 '13 at 20:34
up vote 8 down vote accepted The array @-
contains the offset of the start of the last successful match (in$-[0]
) and the offset of any captures there may have been in that match (in$-[1]
,$-[2]
etc.).There are no captures in your string, so only
$-[0]
is valid, and (in your case) the last successful match is the fourth one, so it will contain the offset of the fourth instance of the pattern.The way to get the offsets of individual matches is to write
my @matches ; while ( "ahoj ahoj ahoj" =~ /( ahoj )/ g ) { push @matches , $1 ; print $ -[ 0 ], "\n" ; }
output
0 5 10
Or if you don't want the individual matched strings, then
my @matches ; push @matches , $ -[ 0 ] while "ahoj ahoj ahoj" =~ / ahoj / g ;
perlrequick - perldoc.perl.org
Society
Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Quotes
War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Somerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Bernard Shaw : Mark Twain Quotes
Bulletin:
Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
History:
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
Classic books:
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Haters Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D
Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.
FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.
This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...
|
You can use PayPal to to buy a cup of coffee for authors of this site |
Disclaimer:
The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.
Last modified: December 30, 2020