Sanitizing Wordpress UTF-8 – or Howto get rid of mixed Latin1 and UTF8 mysql exports
Actually my attention comes to some weird characters in my wordpress blog. Such as ü or ö as represantants of ä and ö. So i had a look into my mysql-db and saw that it was still on latin1. On my way to the clearance i got over that explanation. But still all howtos doesn’t work out for me. There are others allready had a look into it like fischerlander or Haidong’s Blog.
The Problem is more complex as it may look in the first place. Wordpress or PHPBB put the post in the database as they receive it from the Browser. While you can set your Browser on a special Characterset (iso-8859-1) – rather than auto-recognition – and ignore so the servers needs the result are this awkward entries. And furthermore i had the problem that the users had used Windows-1250, UTF-16 and iso-8859-1 mixed.
Startin’ to solve this problem i wrote first this little perl script to generate the awkward chars:
#!/usr/bin/perl use strict; use Encode qw/encode decode/; my @list = ( 0x00 .. 0xFFFF ); # my $letter = "ä"; open CSV, ">lat_new.csv"; foreach my $letter (@list){ $letter = chr($letter); print CSV $letter.";;"; my $utf8 = encode("UTF-8", $letter); print CSV $utf8.";;"; my $utf16 = encode("UTF-16", $letter); print CSV $utf16."\n"; } close CSV;
As this results in a big list and 99% of it is useless i just tried to identify the chars from the several codings that leads to my personal abuse. This results in the following little list of latin-code-combination and UTF-8 representants. It is very likely, if you have chinese or russian characters, you will have a slightly other list
…
ö;ö;Ã;ß;ü;ü;ö;ö;ü;ü;–;–;ö;ö;ü;ü; ; ;¡;¡;¢
;¢;£;£;¤;¤;Â¥;¥;¦;¦;§;§;¨;¨;©;©;ª;ª;«;«;¬;¬;Â;;
®;®;¯;¯;°;°;±;±;²;²;³;³;´;´;µ;µ;¶;¶;·;·;¸;¸;¹;¹;º;º;»;»;
¼;¼;½;½;¾;¾;¿;¿;À;À;Ã;Á;Â;Â;Ã;Ã;Ä;Ä;Ã…;Å;Æ;
Æ;Ç;Ç;È;È;É;
É;Ê;Ê;Ë;Ë;ÃŒ;Ì;Ã;Í;ÃŽ;Î;Ã;Ï;Ã;Ð;Ñ;Ñ;Ã’;Ò;Ó;Ó;Ô;Ô;Õ;Õ;Ö;Ö;×;×;Ø;Ø;Ù;Ù;Ú;Ú;Û;Û;Ü;Ü;Ã;Ý;Þ;Þ;ß;ß;à ;à;á;á;â;â;
ã;ã;ä;ä;Ã¥;å;æ;æ;ç;ç;è;è;é;é;ê;ê;ë;ë;ì;ì;Ã;í;î;î;ï;ï;ð;
ð;ñ;ñ;ò;ò;ó;ó;ô;ô;õ;õ;ö;ö;÷;÷;ø;ø;ù;ù;ú;ú;û;û;
ü;ü;ý;ý;þ;þ;ÿ;ÿ;
I used a little perl-script to sanitize my mysql-export and to dump it into a new file. For security purposes i created a new database in utf8.
#!/usr/bin/perl use strict; open CSV, "<my_csv.csv"; #List from above $li =~ s/\n//; close CSV; # use strict; # use Encode qw/encode decode/; # my @list = ( 0x00 .. 0xFFFE ); # my $letter = "ä"; # my %lat_utf; #open CSV, ">lat_new.csv"; # foreach my $letter (@list){ # $letter = chr($letter); # print CSV $letter.";;"; # my $utf8 = encode("UTF-8", $letter); # $lat_utf{$utf8} = $letter; # print CSV $utf8.";;"; # my $utf16 = encode("UTF-16", $letter); # print CSV $utf16."\n"; # } # close CSV; my %lat_utf = split(/;/,$li); #print %lat_utf; open FILE, "<mydatabase.csv"; my @lines = <FILE>; close FILE; foreach my $key (keys %lat_utf){ # print $key. "\n"; #@lines = map{ s/$key/$lat_utf{$key}/g } (@lines); my @nel; foreach my $lin (@lines){ # print $lin."\n"; if ( $lin =~ /$key/g ) { print $lin; } $lin =~ s/$key/$lat_utf{$key}/g; # print $lin."\n"; push @nel, $lin; } @lines = @nel; } open FILE, ">wdrede_kjr-UTF-8_fixed_new.sql" or die $!; foreach my $lin (@lines){ print FILE $lin; } close FILE;
This results in my special case to a more or less clean wordpress-blog. BUT i realy would think that wordpress and phpbb should check the browser settings and in what codepage the user had put in the comment or content. Because – otherwise it wouldn’t help much to have this converted over a long time…
Am 20. Januar 2010 um 20:42 Uhr
ßääö+ü
Am 20. Januar 2010 um 20:45 Uhr
The trick of keeping it clean is described here:
http://stackoverflow.com/questions/708915/detecting-the-character-encoding-of-an-http-post-request
At least every Form-Tag has to be enhanced by <form accept-charset=”UTF-8″>
urgh…
Am 21. Januar 2010 um 10:45 Uhr
ßöäü€
Am 21. Januar 2010 um 10:45 Uhr
Am 3. Februar 2010 um 13:38 Uhr
Problem: Fucking Browser does not all interpret the accept-charset correctly… still getting worse entries in the comments.
Am 3. Februar 2010 um 13:39 Uhr
ßöäü + Test ß:ss ä:ae ö:oe Ö:oe
Am 11. Februar 2010 um 10:07 Uhr
Espacially Internet Explorer until version 8 are accountable for the worse entries. This Browser doen’t accept the form accept-charset entries and do whatever they want.
I am looking now for some Javascript to give at least the comment form the knowledge of the browser codepage.