It’s about a quarter to three in the afternoon on a Thursday and the twitter goes off! An open source project that I haven’t worked on in a while is having some trouble with UTF-8. I was stuck on some code that was proving to be kind of boring, so I decided to give it a whirl to try and shake my brain up a bit.
My initial solution was way off base. This solution didn’t deal with filenames, it dealt with data in the file. Cucumber-JVM is already setting UTF-8 in all places like it’s supposed to. That wasn’t the solution, it wasn’t even close. The adrenaline fires up as I realize this is going to be a challenge. All thoughts of doing work previously are gone, now this challenge is all that matters. I try my search again, looking to see what else I could find about UTF-8 and filenames.
I come up with this. It feels really close. So I wonder if Aslak’s Mac (henceforce refered to as the AslaMac) has got bogus locale settings.
1 2 3 4 5 6 7 8
15:01 <aslakhellesoy> LANG="en_US.UTF-8" 15:01 <aslakhellesoy> LC_COLLATE="en_US.UTF-8" 15:01 <aslakhellesoy> LC_CTYPE="en_US.UTF-8" 15:01 <aslakhellesoy> LC_MESSAGES="en_US.UTF-8" 15:01 <aslakhellesoy> LC_MONETARY="en_US.UTF-8" 15:01 <aslakhellesoy> LC_NUMERIC="en_US.UTF-8" 15:01 <aslakhellesoy> LC_TIME="en_US.UTF-8" 15:01 <aslakhellesoy> LC_ALL=
So I check out the repo, and then load it up in IDEA. Sure enough, as the compiler error barfs, the class
Æøå was not
Æøå.java file. It was in a
Æøå.java file. Now that’s quite strange. I copy and paste the class name into a
rename of the
.java file, and then everything works just fine for me.
Alsak goes to pull in the pull request to see if it works for him. Git complains that the file is in the way. Wat?
Now we know something is really, really strange. It’s not a terminal locale issue, because the files are generated by a bit of groovy code at every build time. So there’s no terminal issues getting in the way, the Java locale is set, the system locale is set, what the heck?
Aslak modifed the
Ls.java from the previously mentioned article
to barf out the file name as a series of hex characters, so we can see what the heck is going on without having to pipe
hexdump and also to ensure that we’re avoiding any possible terminal issues.
1 2 3 4 5 6 7 8
15:42 <aslakhellesoy> 7: 15:42 <aslakhellesoy> -0000000000000003c793c479e3375d1959e899f 15:42 <aslakhellesoy> 6: 15:42 <aslakhellesoy> -000000000000000003c793c473c5ad1959e899f 15:43 <BeepDog> $ /opt/jdk6/jdk1.6.0_35/bin/java Ls hex 15:43 <BeepDog> -0000000000000003c793c479e3375d1959e899f 15:43 <BeepDog> $ java Ls hex 15:43 <BeepDog> -0000000000000003c793c479e3375d1959e899f
What the heck? We were expecting things to match for jdk6, since it came out correct for him only in that JDK, not in JDK 7 at all. Sadly, we conclude that the problem isn’t maven. It would’ve been much easier to blame the problem on someone elses software, and not our own.
Then it was discovered that the JDK6 on the AslaMac is actually on a MacRoman encoding, not UTF-8. This is a good time to note that the JDK6 was Apple’s Java, and JDK7 is Oracles java. Apple’s JDK6 is happily ignoring the UTF-8 specification and using MacRoman. This caused problems for other projects, and actually isn’t the problem at all, as we would come to discover.
Another gentleman showed up in IRC who had a similar problem and was using a
fr_FR.UTF-8 encoding, although his
problem was only with Aslak’s test project, not cucumber-jvm as he had been compiling on JDK7 for quite a while now. He
ran a different flavor of Linux than I, so it gave us another test case to prove that it’s got to be something specific
to the AslaMac.
We were starting to reach the point that one reaches when searching google and all you find are your own questions or questions with no answers, but this one was just posted by us :(.
In creating test files and such, Aslak figured out that no matter what, the AslaMac would create a file with composite unicode characters, whether by code, or by touching the file.
Then the epiphany happens.
16:22 <BeepDog> http://stackoverflow.com/questions/9757843/unicode-encoding-for-filesystem-in-mac-os-x-not-correct-in-python 16:22 <BeepDog> oh god, I wonder if it's your filesystem
“Mac OSX uses a special kind of
decomposed UTF-8to store filenames. If you need to read in filenames and write them to a ‘normal’ UTF-8 file, you must normalize them. My understanding of this is that when you pass a name with an accented character like é, it will decompose this into e plus ‘ before saving it to the filesystem (this behavior is defined by the Unicode standard).” All about Python and Unicode (this 404’s now :( )
Good god. The Apple filesystem does things differently than everyone else and doesn’t behave well when you commit the file. It’s like the CR/LF problem only much, much worse.
Finally java.text.Normalizer to the rescue. We need to always decompose the UTF-8 filenames to the same format as the AslaMac filesystem (and other Macs) will decompose them to anyway. If we don’t, the generated code will not build, on Apple systems and JDK7. Other operating systems (well, we tested Linux, who develops on windows?!?) are perfectly happy reading in this decomposed UTF-8 format, and compile the source code just fine. So a simple solution is to always write out the Normalized UTF-8. As it turns out, this is already being done to the Jython i18n generated code itself, but not to that filename, since Python scripts are done a bit differently than the java classes and end up all in one
One pull request later and the problem is solved, building on both the AslaMac and my Linux box. Basically the result of it all is forcing the decomposed format for UTF-8, since it is part of the standard, on all filesystems, not just the Mac ones.
Moral of the story: “Java: Write once, compile on linux, run anywhere.”