View Single Post
  #3 (permalink)  
Old 13th February 2008, 08:55 PM
manishkochar manishkochar is offline
Junior Member
 
Join Date: Jun 2007
OS: Debian
Posts: 6
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
manishkochar is on a distinguished road
Default

Quote:
Originally Posted by agn View Post
Code:
for tag in hostAddress username password instancecount
do
    grep  $tag in.xml | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/'
done
Something like the above might help. I don't use bash, so don't know how arrays are populated.
The sed expression was the most complex part, stuffing things into an array, is easy

Code:
#!/bin/bash

for tag in hostAddress userName password instanceCount
do
OUT=`grep  $tag in.xml | tr -d '\t' | sed 's/^<.*>\([^<].*\)<.*>$/\1/' `

# This is what I call the eval_trick, difficult to explain in words.
eval ${tag}=`echo -ne \""${OUT}"\"`
done

# So let's stuff the obtained results into 4 different Arrays

H_ARRAY=( `echo ${hostAddress}` )
U_ARRAY=( `echo ${userName}` )
P_ARRAY=( `echo ${password}` )
I_ARRAY=( `echo ${instanceCount}` )

# Ok, time to announce success, let's printout each of the arrays

echo ${H_ARRAY[@]}
echo ${U_ARRAY[@]}
echo ${P_ARRAY[@]}
echo ${I_ARRAY[@]}

# For the benefit of agn - 
# We can now refer to each unique element of the array like this -

echo ${H_ARRAY[0]} 

# The above prints the first item in array H_ARRAY
I chanced upon this thread, because, I am trying to do a similar project.
The specs look rather challenging, for my poor knowledge of sed.
So let's see if agn can crack this one too!

I want to create a list of web-sites that definitely contain pornographic, or adult content, that's not suitable for kids, at school.
I can see that the dmoz offers it's data in an xml format.
I also noticed that the xml file contains descriptive information about each web-site.

Now this is what I want to do -
A shell script, wherein I specify (via PCRE, of course) the look_up_string.
Based on the look_up_string, I want to, collect in a file the names of web-sites. I don't want the whole URL, just the hostname is enough.
I will then later set this hostname in my hosts file, to ensure effective blocking of these sites.

Could anybody help on this?
Reply With Quote