nixCraft Linux Forum

nixCraft

Linux / UNIX Tech Support Forum

Perl / Shell Regex to clean and remove all HTML tags

This is a discussion on Perl / Shell Regex to clean and remove all HTML tags within the Coding in General forums, part of the Development/Scripting category; Hello, I have the text below, and need to extract all domain names ending in .ro I tried a perl ...


Go Back   nixCraft Linux Forum > Development/Scripting > Coding in General

Linux answers from nixCraft.


Coding in General Discussion on PHP/Perl/Python/Ruby/GNU C or C++. MySQL, PgSQL and (X)HTML or any other programming languages you want.

Reply

 

LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 27-01-2009, 12:43 AM
Junior Member
User
 
Join Date: Jul 2008
Location: SM
OS: CentOs
Posts: 24
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
cosminnci is on a distinguished road
Default Perl / Shell Regex to clean and remove all HTML tags

Hello,

I have the text below, and need to extract all domain names ending in .ro

I tried a perl script

Code:
#!/usr/bin/perl -wnl

$_=~ s/\Q>\E/ /g
and
/.+ *\Q.ro\E/i
and
print $&;
but still matching other characters than needed.

Code:
<td ALIGN=CENTER style="padding:.75pt;height:1">
<center>
<p><font size="2">92</font></p>
</center>
</td>

<td ALIGN=CENTER style="padding:.75pt;height:1">
<center>
<p><font size="2">domain1.ro<br>
domain2.com.ro<br>
domain3.ro<br>
domain4.ro</font></p>
</center>
</td>
Thanks,
Cosmin
Reply With Quote
  #2 (permalink)  
Old 27-01-2009, 01:28 AM
nixcraft's Avatar
Never say die
User
 
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash and Python
Posts: 2,707
Thanks: 11
Thanked 244 Times in 183 Posts
Rep Power: 10
nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute nixcraft has a reputation beyond repute
Default

How about sed one liner?
Code:
sed -e :a -e 's/<[^>]*>//g;/</N;//ba' file.html
or
Code:
grep '.ro' file.html | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'
__________________
Vivek Gite
Linux Evangelist
Be proud RHEL user, and let the world know about your enterprise choices! Join RedHat user group.
Always use CODE tags for posting system output and commands!
Do you run a Linux? Let's face it, you need help
Reply With Quote
Reply

Tags
linux , perl , perl remove html tags , sed , sed remove html tags , unix


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads

Thread Thread Starter Forum Replies Last Post
Add HTML Code Inside HTML File eawedat Coding in General 5 05-11-2008 05:53 PM
UNIX Shell Change the EOL (\n newline) by nothing - remove new line permalac Shell scripting 5 14-05-2008 11:03 PM
Perl simple html mail chiku Coding in General 3 17-08-2007 07:59 PM
Shell,perl plsql programmer required (get paid) amitoverseas40 Solaris/OpenSolaris 0 13-08-2005 09:15 PM
Shell,perl plsql programmer required (get paid) amitoverseas40 Shell scripting 0 13-08-2005 09:12 PM


All times are GMT +5.5. The time now is 10:41 PM.


Powered by vBulletin® Version 3.8.5 - Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2
©2005-2010 nixCraft. All rights reserved

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38