nixCraft Linux Forum

nixCraft

Linux Tech Support Forum

ISO8859-1 to UTF-8 script wanted

This is a discussion on ISO8859-1 to UTF-8 script wanted within the Shell scripting forums, part of the Development/Scripting category; It seems I can use iconv for this. When I do # iconv --from-code=ISO-8859-1 --to-code=UTF-8 ./oldfile.htm ./newfile.htm it gives the ...


Go Back   nixCraft Linux Forum > Development/Scripting > Shell scripting

Register FAQ Members List Calendar Forgotten your password? Mark Forums Read
  #1 (permalink)  
Old 06-02-2007, 06:12 PM
Junior Member
User
 
Join Date: Feb 2007
Posts: 19
Rep Power: 0
meowing
Default ISO8859-1 to UTF-8 script wanted

It seems I can use iconv for this. When I do
# iconv --from-code=ISO-8859-1 --to-code=UTF-8 ./oldfile.htm ./newfile.htm
it gives the right result for my webserver's config.

I have huge folders with htm and txt files to do this for, so I would like to not have to rename them by hand.
Plus I would like the script to check for certain characters in the files, in order to decide if a file should be converted or not.

I found this script somewhere:
Code:
#!/bin/bash

  for i in $* ; do
    echo "Converting $i ..."
    mv -i $i $i.bak
    iconv -f ISO8859-1 -t UTF-8 $i.bak >$i
  done
But it doesn't do anything when I run it. Also, I notice the dash in ISO-8859-1 is missing, is that of importance?

How do I add a character-search to it?

Only if my documents (only *.txt and *.ht*) contain
"é" or "à" or "è" or "ï" or "ë" or "ó"
they should be converted to UTF-8, if not they should be left alone.

I have done this before, but unfortunately lost the script and couldn't find much about it online. For example, how do I determine the character's presence? I have a separate backup of all the files, so a total replace script would do.
Any help or examples very much appreciated.

Last edited by meowing; 06-02-2007 at 06:39 PM..
Reply With Quote
Sponsored Links
  #2 (permalink)  
Old 06-03-2007, 12:00 AM
Member
User
 
Join Date: Jun 2005
Posts: 78
Rep Power: 0
jerry
Default

Correct code..
Code:
#!/bin/bash
mypath="$1"
  for i in "$mypath"
  do
    echo "Converting $i ..."
    mv -i $i $i.bak
    iconv -f ISO8859-1 -t UTF-8 $i.bak >$i
  done
set up permision and covert all html files
Code:
./script *.html
Quote:
"é" or "à" or "è" or "ï" or "ë" or "ó"
you can use egrep command
Code:
egrep "é|à|è|ï|ë|ó" $i
if [ $i -eq 0 ]; then # found
  # do something or call above script
fi

Last edited by jerry; 06-03-2007 at 12:00 AM.. Reason: code typo fixed
Reply With Quote
  #3 (permalink)  
Old 06-03-2007, 03:16 AM
Junior Member
User
 
Join Date: Feb 2007
Posts: 19
Rep Power: 0
meowing
Default

Quote:
Originally Posted by jerry View Post
Correct code..
Code:
#!/bin/bash
mypath="$1"
  for i in "$mypath"
  do
    echo "Converting $i ..."
    mv -i $i $i.bak
    iconv -f ISO8859-1 -t UTF-8 $i.bak >$i
  done
set up permision and convert all html files
Code:
./script *.html
It only converts just 1 file each time I run it. The wildcard (*) does not seem to work. How do I get it to convert ALL files in a directory?

Also, I think egrep does not see the correct characters in the files. I should probably use codes instead of the characters as is.
Reply With Quote
  #4 (permalink)  
Old 07-23-2007, 09:51 PM
Junior Member
User
 
Join Date: Feb 2007
Posts: 19
Rep Power: 0
meowing
Default

Your script, for some reason, did not work.
This one does:
Code:
#!/bin/bash
FROM=iso-8859-1
TO=UTF-8
ICONV="iconv -f $FROM -t $TO"
# Convert
find /some/folder/ -type f -name "*" | while read fn; do
cp ${fn} ${fn}.bak
$ICONV < ${fn}.bak > ${fn}
rm ${fn}.bak
done
Where /some/folder/ is the one where all the files are (that need conversion).
Just rename it to something.sh and then run it from command line, that will do the job.
You can change the From and To fields to your liking.
Reply With Quote
  #5 (permalink)  
Old 09-20-2007, 02:24 AM
Junior Member
User
 
Join Date: Sep 2007
My distro: Debian
Posts: 2
Rep Power: 0
gary_johnson_53 is on a distinguished road
Default Files with spaces in the names

I have files with spaces in the names

Example: à blanc.txt

How do I modify this script to handle this case?


cp: target `blanc.txt.bak' is not a directory
testiconv: line 8: CB2/TermsDictionaryGlosary/ZZAcented/à blanc.txt.bak: No such file or directory
Reply With Quote
  #6 (permalink)  
Old 09-22-2007, 10:47 PM
Junior Member
User
 
Join Date: Sep 2007
My distro: Debian
Posts: 2
Rep Power: 0
gary_johnson_53 is on a distinguished road
Default I used Bash's Internal File Separator

# set Bash's Internal File Separator (IFS) to just line end instead of space and line end
IFS="
"
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads

Thread Thread Starter Forum Replies Last Post
games wanted kasimani The Hangout 5 03-24-2008 11:38 PM


All times are GMT +5.5. The time now is 05:36 PM.


Powered by vBulletin® Version 3.7.4 - Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36