The project I currently work on involves a lot of Java/JavaScript (V8 JavaScript engine) interoperability. Fortunately, Java provides JNI and V8 has a nice C++ API which make the integration process very smooth. Most of the Java-JNI-V8 type marshaling is quite straightforward but there is one exception.
The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.
I was well aware of this fact since the beginning of the project but somehow I neglected it. Until recently, when one of my colleagues showed me a peculiar bug that turned out to be related to the process of marshaling a non-trivial Unicode string.
At first, I tried a few quick and dirty workarounds just to prove that the root of problem is more complex it seemed. Then I realized that jstring
type is not the best type when it comes to string interoperability with V8 engine. I decided to use jbyteArray
type instead of jstring
though I had some concerns about the performance overhead.
private static native void doSomething(byte[] strData); String s = "some string"; byte[] strData = s.getBytes("UTF-8"); doSomething(strData);
The code doesn’t look ugly though the string version looks better. I did microbenchmarks and it turned out the performance is good enough for my purposes. Nevertheless, I decided to compare the performance with Nashorn JavaScript engine. As expected, Nashorn implementation was faster because it uses the same internal string format as the JVM.