Android MediaScanner Cannot Display Chinese Characters

Problem Description

On Android Develop Phone 1 (ADP1) running HTC-provided system image of Android 1.5, most of the Chinese characters cannot be diplayed in the library of the music player. This is actually due to the media scanner in OpenCore fails to resolve the proper character encoding for Chinese characters in ID3 tags of MP3 files. The problem is also described at

In the parseMP3() function of “external/opencore/android/mediascanner.cpp”, PVID3ParCom from “external/opencore/fileformats/id3parcom/src/pv_id3_parcom.cpp” is used to parse ID3 tag of MP3 files to get frames information. Then the key string is read from each frame. The key string may look like “artist;valtype=char*;char-encoding=UTF8” if the frame contains the name of the artist and the characters are encoded as UTF8. It may also look like “album;valtype=char*” if the frame contains the name of the album and the characters are encoded as one of the native charsets, which can be ISO 8859-1 — a standard character encoding for the Latin alphabet, GBK for simplified Chinese, Big5 for traditional Chiense, etc.

However, in the implementation of parseMP3() function, all the non-UTF8 native charsets are treated as ISO 8859-1 and converted to UTF8 as if they are ISO 8859-1. Then in the endFile() function, possibleEncodings() function is used to “compute a bit mask containing all possible encodings” (quoted from the comment in the original source file, so are the following two quotations). “If the locale encoding matches”, then convertValues() function is called to “untangle the utf8 and convert it back to the original bytes”, which was mistaken to be treated as ISO 8859-1 before. Then the original bytes are converted to UTF8 again by using the charset converter from ICU library. This time the conversion is based on more proper esitimation of the charset.


One obvious solution is to change the locale of your ADP1. For me, most of my music collection use simplified Chinese in the ID3 tags, and most of them are using GBK encodings instead of UTF8. But HTC-provided system image comes with en-US as the only available locale. In order to add more locales, it’s required to build the source code yourself and flash the images to ADP1. For getting and building the source code, flashing the phone, and including Google applications, please refer to Johan de Koning’s 5-post series of “Building Android 1.5”, which starts from

After booting up the ADP1 with the system image you build, change your locale in the Settings. The mediascanner service might not be able to update the ID3 tag information yet because it will skip the media files that haven’t been modified since last time. One way is to force the “scanAlways” parameter in doScanFile() function from “frameworks/base/media/java/android/media/” to be “True” in the code. However, it’s better to only use this trick temporarily, otherwise it is too resource consuming to re-scan all media files every time.

In case you don’t want to change your locale but still would like your media scanner to be able to read Chinese, Japanese or Korean, then an alternative way is to apply some patch to “external/opencore/android/mediascanner.cpp” like this:

diff -Naur src-1.5-git-orig/external/opencore/android/mediascanner.cpp src-1.5-git/external/opencore/android/mediascanner.cpp
--- src-1.5-git-orig/external/opencore/android/mediascanner.cpp	2009-07-24 13:28:09.000000000 +0200
+++ src-1.5-git/external/opencore/android/mediascanner.cpp	2009-07-24 13:30:02.000000000 +0200
@@ -1008,10 +1008,16 @@
         // compute a bit mask containing all possible encodings
         for (int i = 0; i < mNames->size(); i++)
             encoding &amp;= possibleEncodings(mValues->getEntry(i));
-        // if the locale encoding matches, then assume we have a native encoding.
-        if (encoding &amp; mLocaleEncoding)
-            convertValues(mLocaleEncoding);
+        /* FIX: convert to utf8 accordingly disregard of the current locale */
+        if (encoding &amp; kEncodingGBK)
+            convertValues(kEncodingGBK);
+        else if (encoding &amp; kEncodingBig5)
+            convertValues(kEncodingBig5);
+        else if (encoding &amp; kEncodingEUCKR)
+            convertValues(kEncodingEUCKR);
+        else if (encoding &amp; kEncodingShiftJIS)
+            convertValues(kEncodingShiftJIS);
         // finally, push all name/value pairs to the client
         for (int i = 0; i < mNames->size(); i++) {

Note this is not a proper bug fix, because charset can never be easily estimated. It is not uncommon that the same string can be estimated as GBK, Big5, EUCKR or ShiftJIS at the same time. Depending on the proportion of your music, the sequence can be adjusted in if-statement in the patch above to reflect the charset priorities.

Nevertheless the best way to solve such problems is to convert ID3 tags of all your media files to UTF8 beforehand. Unicode or UTF8 is really the way to go.

Comments are closed.