Sanitizing WordPress UTF-8 – or Howto get rid of mixed Latin1 and UTF8 mysql exports

Actually my attention comes to some weird characters in my wordpress blog. Such as ü or ö as represantants of ä and ö. So i had a look into my mysql-db and saw that it was still on latin1. On my way to the clearance i got over that explanation. But still all howtos doesn’t work out for me. There are others allready had a look into it like fischerlander or Haidong’s Blog.

The Problem is more complex as it may look in the first place. WordPress or PHPBB put the post in the database as they receive it from the Browser. While you can set your Browser on a special Characterset (iso-8859-1) – rather than auto-recognition – and ignore so the servers needs the result are this awkward entries. And furthermore i had the problem that the users had used Windows-1250, UTF-16 and iso-8859-1 mixed.

Startin‘ to solve this problem i wrote first this little perl script to generate the awkward chars:

#!/usr/bin/perl

  use strict;
  use Encode qw/encode decode/;
  my @list = ( 0x00 .. 0xFFFF );
#  my $letter = "ä";

open CSV, ">lat_new.csv";
  foreach my $letter (@list){
  $letter = chr($letter);
  print CSV $letter.";;";
  my $utf8 = encode("UTF-8", $letter);
  print CSV $utf8.";;";
  my $utf16 = encode("UTF-16", $letter);
  print CSV $utf16."\n";
  }
  close CSV;

As this results in a big list and 99% of it is useless i just tried to identify the chars from the several codings that leads to my personal abuse. This results in the following little list of latin-code-combination and UTF-8 representants. It is very likely, if you have chinese or russian characters, you will have a slightly other list :-)…

ö;ö;ß;ß;ü;ü;ö;ö;ü;ü;–;–;ö;ö;ü;ü; ; ;¡;¡;¢
;¢;£;£;¤;¤;Â¥;¥;¦;¦;§;§;¨;¨;©;©;ª;ª;«;«;¬;¬;­;­;
®;®;¯;¯;°;°;±;±;²;²;³;³;´;´;µ;µ;¶;¶;·;·;¸;¸;¹;¹;º;º;»;»;
¼;¼;½;½;¾;¾;¿;¿;À;À;Á;Á;Â;Â;Ã;Ã;Ä;Ä;Ã…;Å;Æ;
Æ;Ç;Ç;È;È;É;
É;Ê;Ê;Ë;Ë;ÃŒ;Ì;Í;Í;ÃŽ;Î;Ï;Ï;Ð;Ð;Ñ;Ñ;Ã’;Ò;Ó;Ó;Ô;Ô;Õ;Õ;Ö;Ö;×;×;Ø;Ø;Ù;Ù;Ú;Ú;Û;Û;Ãœ;Ü;Ý;Ý;Þ;Þ;ß;ß;à ;à;á;á;â;â;
ã;ã;ä;ä;Ã¥;å;æ;æ;ç;ç;è;è;é;é;ê;ê;ë;ë;ì;ì;í;í;î;î;ï;ï;ð;
ð;ñ;ñ;ò;ò;ó;ó;ô;ô;õ;õ;ö;ö;÷;÷;ø;ø;ù;ù;ú;ú;û;û;
ü;ü;ý;ý;þ;þ;ÿ;ÿ;

I used a little perl-script to sanitize my mysql-export and to dump it into a new file. For security purposes i created a new database in utf8.

#!/usr/bin/perl

use strict;

open CSV, "<my_csv.csv"; #List from above
$li =~ s/\n//;
close CSV;

#  use strict;
#  use Encode qw/encode decode/;
#  my @list = ( 0x00 .. 0xFFFE );
#  my $letter = "ä";
#  my %lat_utf;
#open CSV, ">lat_new.csv";
#  foreach my $letter (@list){
#  $letter = chr($letter);
#  print CSV $letter.";;";
#  my $utf8 = encode("UTF-8", $letter);
#  $lat_utf{$utf8} = $letter;
#  print CSV $utf8.";;";
#  my $utf16 = encode("UTF-16", $letter);
#  print CSV $utf16."\n";
#  }
#  close CSV;

my %lat_utf = split(/;/,$li);
#print %lat_utf;
open FILE, "<mydatabase.csv";
my @lines = <FILE>;
close FILE;

foreach my $key (keys %lat_utf){
#   print $key. "\n";
#@lines = map{ s/$key/$lat_utf{$key}/g } (@lines);
    my @nel;
    foreach my $lin (@lines){
#       print $lin."\n";
        if ( $lin =~ /$key/g ) {
            print $lin;
        }
        $lin =~ s/$key/$lat_utf{$key}/g;
#       print $lin."\n";
        push @nel, $lin;
    }
    @lines = @nel;
}

open FILE, ">wdrede_kjr-UTF-8_fixed_new.sql" or die $!;

foreach my $lin (@lines){
  print FILE $lin;
}

close FILE;

This results in my special case to a more or less clean wordpress-blog. BUT i realy would think that wordpress and phpbb should check the browser settings and in what codepage the user had put in the comment or content. Because – otherwise it wouldn’t help much to have this converted over a long time…

10 Antworten zu “Sanitizing WordPress UTF-8 – or Howto get rid of mixed Latin1 and UTF8 mysql exports”

  1. yash sagt:

    dear friend i am new to php i am

    when i submit my blog link to faceboook it looks like

    नवग्रहों में राहु एवं केतु की शान्ति क…

    what to do friend ?

  2. Istvan Ozsvar sagt:

    You say:
    > …It is very likely, if you have chinese or
    > russian characters, you will have a
    > slightly other list…

    From where can I get other combinations? (e.g.: russian)

  3. Joe sagt:

    Espacially Internet Explorer until version 8 are accountable for the worse entries. This Browser doen’t accept the form accept-charset entries and do whatever they want.

    I am looking now for some Javascript to give at least the comment form the knowledge of the browser codepage.

  4. Joe sagt:

    ßöäü + Test ß:ss ä:ae ö:oe Ö:oe

  5. Joe sagt:

    Problem: Fucking Browser does not all interpret the accept-charset correctly… still getting worse entries in the comments.

  6. Joe sagt:

    The trick of keeping it clean is described here:
    http://stackoverflow.com/questions/708915/detecting-the-character-encoding-of-an-http-post-request

    At least every Form-Tag has to be enhanced by <form accept-charset=“UTF-8″>
    urgh…

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind markiert *

11 − eins =