PDA

View Full Version : Help with a regular expression using /s



jackster
12-06-2006, 04:20 PM
I'm trying to write a regular expresson to go from one line to the next using the /s modifier. Here is the text that I'm parsing:

<th class="tabletitle">Company Name</th>
<th>Value</th>
</tr>
<tr>
<td class="TDtext">BJ Services Co</td>

This regexp works for the first line up to and including </th>

$body =~ m/^\s+\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>\s+/s

But I can't get to the second line. I tried adding .*\<th at the end hoping the regexp would continue through the space to the <th on the second line but that's were it dies.

Any hints on how to include that second line into the expression?
thanks

allyn
12-06-2006, 05:14 PM
you didn't say but i'm guessing this is perl.

you need to use a \n or . or \s to match the newline character.

example:
$body = "1\n2";
$body =~ s/1\n2/3\n4/s;
print $body, "\n";

jackster
12-06-2006, 06:29 PM
Thanks Allyn,

So are you saying that in my example:

$body =~ m/^\s+\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>\s+/s

I would need to add a \n to get to the begining of the second line that starts with <th> like this?:

$body =~ m/^\s+\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>\n\<th\>/s

jalal
12-07-2006, 02:39 AM
You are using the 's' modifier at the end of the pattern which should work. You could try using the 'm' modifier. The 'm' modifier allows ^ and $ to match \n. The 's' modifier has them ignore it, but . will match \n.
(The 'm' at the start of the patter is the operator, the modifiers go at the end).

I'm not sure what you are trying to do otherwise I might get more specific. Do you have a table of company names and values?

jackster
12-07-2006, 06:47 AM
Thanks so much for your help Jalal.

I am using a perl script to grab the contents of a web page and pull out a list of company names. The problem is, I can easily find the string "Company Name" which is a constant, but to get the actual company names from there, I have to parse through several html lines that are all indented and start with <th, here is where I'm having the trouble. I have tried every possible combination of /s, /m and /sm in hopes of matching the beginning of the second line with no luck. My original post has the exact html code that I'm trying to sort through.

jackster

jalal
12-07-2006, 11:35 AM
Wouldn't it be simpler go through the page line by line?

Get the page.
Search through the lines until you get to 'Company Name' (or 'class="tabletitle"')
Then run a regexp on each line looking for 'class="TDtext"'
Until you get to the end of the table.

Just an idea.

jackster
12-07-2006, 07:41 PM
that's pretty much what I'm doing by matching on Company Name then using a regular expression from there to get the next several lines. Any chance you can help me get from the first line of my original post to the second?

thanks

blender
12-08-2006, 10:08 AM
You are on the right track here, but your regular expression needs something extra. In your original post, your regex is:

m/^\s+\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>\s+/s

You are giving it the 's' flag to match multiple lines, but the regex itself is only asking to match the contents of the first line plus one or more spaces.

If you change the regex to ask for more that what is found on just the first line you will see that it is spanning multiple lines as intended:

m/^\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>.*TDtext\"\>(.*)\<\/td\>/s

with the results being:

$1 = 'Company Name' and $2 = 'BJ Services Co'

For reference, here is the code I used to test this:

#!/usr/bin/perl

use strict;

my $string =<<"EOF";
<th class="tabletitle">Company Name</th>
<th>Value</th>
</tr>
<tr>
<td class="TDtext">BJ Services Co</td>
EOF

if($string =~ m/^\<th\sclass\=\"tabletitle\"\>(Company\sName)\<\/th>.*TDtext\"\>(.*)\<\/td\>/s) {
print "MATCH = '$1' '$2'\n";
}

Hope that helps.

-blender

jackster
12-10-2006, 06:29 PM
Thanks alot Blender, for taking the time to look at that regexp.
I was able to come up with a work around in the mean time.

I downloaded and installed HTML::Parse from Cpan and used it to grab the page I want and remove all tags. When I did that, all the text was reduced to one huge line instead of many small individual lines. I was able to easily parse through it and grab what I needed.
I will try your suggestion on my test server when I get back to work.

thanks again!

Jackster