When we work with text on a computer, we usually seem to take it for granted that whatever we write is stored by the system exactly as we write it. If you are old enough to remember DOS and code-pages, then you also remember the fun one can have when trying to exchange documents across regional borders.
Each region has its own set of characters they consider important. Everyone can agree on the first 7-bit of the character set encodings, but move beyond that any you are in no-mans land.
Now it is 2014 and it this was relevant 20 years ago. So why bother?
It matters, because old standards never die. In Java, some files used by standard functions have prescribed encodings. For instance, property files used by Java-code must always be encoded as ISO-8859-1. Property files used by GWT code must be encoded as UTF-8. Source files can have any encoding and the compiler will select the encoding based on your system default.
So for property files, you have to keep two separate copies in case you share code and you have to tell your editor that this file types may be different based on how you human operator want to use them. So there is a default – but the default sucks.
But worst: The compiler accepts everything and uses a system dependent default. So unless you specify the encoding manually, your local compile result can differ greatly from the result produced by a different machine. Now try to debug that!
Recently I had a rather strange case to work on where this chaos mattered.
As part of the Pentaho Reporting 5.0 release I normalized all the source code of Pentaho Reporting to the safe US-ASCII character set. Remember, all character sets in use agree on the lower 7-bits of their character range. This corresponds to the US-ASCII set of characters. Therefore, no matter where you are in the world and no matter what machine you use – the source code will look and behave the same everywhere.
As a result, I ‘fixed’ the salt-string for our password obscuration, which contained the Unicode character ‘SECTION SIGN’ (U+00A7). Apparently, when you use ISO-8859-1 as encoding for source code, this character works just fine. But our CI and release build machine actually uses UTF-8 as its default encoding for source code, and there this character is invalid, and must be encoded as a 2-byte sequence.
Therefore, the release of Pentaho Reporting 5.0 decodes and encodes passwords differently than the previous 4.8 release. Now, if you were always good and enterprise ready, you would use JNDI datasources and none of that would matter to you. But some users want or need to access databases that are not defined via JNDI – and thus store passwords in the PRPT file.
All these reports broke when running on 5.0 with a “Invalid password” error reported by the database.
Testing with a local build, however, showed no error. The reports ran fine.
Only thanks to the detailed bug-report with a great sample on what causes the breakage I was able to make the connection to ultimately narrow it down to a variation in the binary files produced by the build machine. Thank god that Pentaho Reporting is part of our open source offering and that there was no code obscuration used in that build. Therefore I was able to decompile the class file and see where that restored source code differed from the sources we have in GIT.
Lessons learned: (1) Don’t assume your sources reflect your binary files. (2) Machine dependent defaults suck as much today as they did 20 years ago. And (3) Never assume that the user provides sensible settings and thus use the safest option possible. In our case encode all files as US-ASCII with plenty of use of escape-sequences for characters outside of that range. It may be ugly, but it is guaranteed to work on every machine regardless of the developer’s defaults.