First things first: Description of the problem
The database was originally an old Xoops 2 database, which was recently imported / converted to vBulletin Suite 4. The old database was in the "latin1_swedish_ci" character set (which was actually the default character set for MySQL back in those days) yet it contained Greek characters (the forum is Greek) along with Latin characters. During the conversion to vBulletin 4, we kept the same encoding to make sure things went smoothly and no information would be lost. But eventually I knew it would be best to move to Unicode sooner rather than later. Besides, it was a good opportunity to expand my knowledge in this field. ;-)
So I had a Latin1 database of roughly 110 MBytes in size in my hands which I wanted to convert to UTF-8. Sounds simple enough, doesn't it? Well, it isn't.
Preparation, research and first tests.
After searching on the web a bit, then searching a bit more, then a lot more, I found out many people were facing problems and having headaches with similar conversions. I decided to give it a try anyway. Most people claimed that if convert the tables to another collation and character set, you may just get what you need and be done with it.
Naturally, it very rarely is that easy (one of Murphy's Laws I guess).
When I tried viewing the table entries on the live server, all Greek characters look "scrambled". For example like this line:
EÃá ðïëéôéóìéêü óïê ôï åðáèá , Ãïìéæá ïôé ìðçêá óôï amiga.org!The above line should look like this:
Eνα πολιτισμικό σοκ το επαθα , νομιζα οτι μπηκα στο amiga.org!So it's not only that I wanted to move things to Unicode, it was important to have the proper characters showing up first. Note that the text was displaying correctly on the website of course, but that's because the site's header was set to specify the encoding as "ISO-8859-7" (Greek). Greek and Latin1 characters would show normally, but anything else would not (for example something in Swedish like "skål" would display garbage).
When trying to convert a table to another collation using the MySQL command "CONVERT", I got more scrambled text as a result. It didn't work for me and I needed to know why. But where do I start?
To keep it safe, I decided to install a local MySQL server and grab a fresh dump of the database from our live server. (BTW, I highly recommend the Heidi SQL tool if you're using Windows). Then I got a sample of the data to investigate further. I figured that if I find out what sequence of encoding/converting would result in the characters displaying correctly (i.e. in Greek), then I can use the same steps on the whole database itself.
And so I started testing with that sample of text, converting it until I got what was going on with it. A few hours later, the mystery was solved...
It appears that the text was actually UTF-8 encoded Latin1 characters, since our new database server had UTF-8 set by default for all new databases, even if the tables themselves were created as "Latin1_swedish_ci" collations. A simple test proved that, by entering the text in a Unicode converter and changing it from UTF-8 to Latin1, I got the following results:
Eíá ðïëéôéóìéêü óïê ôï åðáèá , íïìéæá ïôé ìðçêá óôï amiga.org!
Now, we're not quite there yet (the text above is not Greek) but at least we are making progress!
I recognized the above text as Latin1 containing Greek characters so to make sure, I placed that text in another converter and changed it to ISO-8859-7 (Greek). Here are the results:
Eνα πολιτισμικό σοκ το επαθα , νομιζα οτι μπηκα στο amiga.org!
Voila! We have the proper Greek showing up!
Making the changes in the database
Now that I had a specific step-by-step plan to get the Greek characters showing, I needed to adjust those steps in the database itself. I needed to get the whole database converted from UTF-8 to Latin1, then from Latin1 to Greek and finally from Greek back to UTF-8 again (so we'd get the Greek characters showing up properly in the tables).
Here are the steps I took to accomplish that:
- Closed the Forum on the live server.
- Got a fresh dump of the database from the live server.
- Restored the dump on my local MySQL server to perform the conversions locally (never work on a live server itself!).
- Exported the database from my local server to forced Latin1 encoding, using the following MySQL command:
mysqldump -h HOSTNAME -u USERNAME -pPASSWORD --opt --quote-names --skip-set-charset --default-character-set=latin1 AMIGAHELLAS_VBULLETIN4 > dump_latin1.sql(needless to say, you need to change the above variables like "Hostname", "Username", "Password" and filenames according to your needs).
- Re-imported the "dump_latin1.sql" in my local server with forced UTF-8 encoding, using the following command:
mysql -h HOSTNAME -u USERNAME -pPASSWORD --default-character-set=utf8 AMIGAHELLAS_VBULLETIN4 < dump_latin1.sqlThis step would mean that the characters in the file would be treated as Unicode and will be automatically converted to such while importing.
- Next, I exported a new dump (once again) from the local server using the same command as above (mysqldump ...), again in Latin1. After this step, we should be seeing the "Latin1" characters in the file instead of the Unicode ones, as I did during my sample test after changing the encoding from UTF-8 to Latin1.
- Now we need to convert this dump file to Greek and then back to UTF-8 again. To do that, I used the highly recommended Notepad++ application, but you can do this step in any other way you choose...
Open the latest dump file in Notepad++ and change Character Set to "Greek -> ISO-8859-7" (menu Encoding->Character Sets->Greek->ISO-8859-7).
- Convert the document to UTF-8 (menu Encoding->Convert to UTF-8).
- Next up, we need to replace any instances of "latin1" with "utf8" and "latin1_bin" with "utf8_general_ci" in the document. I did a search and replace with Notepad++ to do that. The reason for this step is to have the database loaded as UTF-8 when it's restored on the server, since that's what we've converted it to now!
- Finally, save the dump file with a new filename. I chose to name it "dump_utf8.sql".
- Test restore the file on the local server. After it's done, the table contents should be displaying proper Greek when browsing the database.
- Assuming the above step was successful, restore the "dump_utf8.sql" file on the live server.
Congratulations, your database is now in Unicode having proper Greek in it!
I wasn't completely done however. vBulletin needed a few changes to accommodate for the difference in the database encoding. So I needed to take a few extra steps to have the text appearing correctly:
- In the vBulletin Admin panel, go to Languages and for each Language change the encoding to "UTF-8" instead of the previous "ISO-8859-7". That will change the header in the HTML code pushed to the browser.
- Edit the "/includes/config.php" file on the vBulletin installation, find the "mysqli" line and uncomment it. This will enable utf-8 as default encoding for vBulletin itself.
- Finally, I needed to do another upgrade of vBulletin (using their built-in upgrade system) to update all the tables and have the text appearing correctly. I'm not sure why this step is necessary, but it seems to be otherwise I got blanks and missing text on the pages, although the text was normal in the database itself.
And that's it! The whole process took me 2 days (mostly researching and testing) but the actual conversion process was about 2 hours worth of work. Once you know what you need to do, it's much easier. Hopefully this will serve someone else as a guide as well and save someone a bit of headaches and wasted hours...
2 comments:
OMG...
I just admire you...
iconv -f ISO_8859-7 -t UTF-8 [file]
Post a Comment