New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn when producing invalid UTF-8 output files #1704
Comments
@darwin, Hi, |
@junjiemars: for a workaround set [1] https://github.com/binaryage/dirac/blob/a9f5d1c842ff43934b4fc5ed665d3fcd903a655c/project.clj#L191 |
Thanks @darwin , |
@darwin thanks! I came here during the work with |
Dug into this issue a bit: https://github.com/pesterhazy/cljs-utf8-issue Looks like Chrome is at fault here |
Do you have a repro that doesn't involve clojurescript? Maybe upload the JS file emitted by the ClojureScript compiler. |
@MatrixFrog I've uploaded the generated js file here: https://raw.githubusercontent.com/pesterhazy/cljs-utf8-issue/master/generated/main.js You can see that, like Chrome,
I tried to reproduce using current GCC and GCL but failed - the generated file didn't contain the offending byte sequence
(as seen in less). Compiled directly with closurebuilder.py I get
|
Actually let me take that back. I've pushed a new version to https://github.com/pesterhazy/cljs-utf8-issue. I had to pass the
|
All it takes is to set the charset to |
I'm getting slightly lost. Is this an issue in the Closure Compiler, or in closurebuilder.py? |
@MatrixFrog right you are. I had to find out how to call the Closure Compiler jar including the Closure Library directly from the command line. I've simplified the repo: https://github.com/pesterhazy/cljs-utf8-issue/. Now reproduction only requires one command The generated file is here: https://github.com/pesterhazy/cljs-utf8-issue/blob/master/out/output.js |
+1 to just outputting ASCII as a workaround. However, it would be good for us to fix this as well. Around https://github.com/google/closure-compiler/blob/master/src/com/google/javascript/jscomp/CodeGenerator.java#L1835 we check if the charset encoder can encode the character, but then we just do sb.append(c) instead of actually letting the encoder actually encode it. The simplest fix might to change how U+FFFF is handled in appendHexJavaScriptRepresentation: https://github.com/google/closure-compiler/blob/master/src/com/google/debugging/sourcemap/Util.java#L111 |
Is anyone aware of configuring this via the closure api ? I don't see any mention of the |
Internal Google issue http://b/142068222 created |
When running the compiler with
--charset UTF-8
one can potentially end up with not strictly valid output files.The problem is that sometimes strings in input sources can be perfectly valid ASCII strings, but they encode invalid UTF-8 strings via unicode escape sequences. One such example can be found in Closure Library itself[1].
Closure Compiler when instructed to produce output in UTF-8 blindly assumes validity of such strings and outputs them as raw byte output (without checking). This can cause problems in some systems. For example in my case Google Chrome extension content script loading code is quite strict and uses UTF-8 validator to reject any non-valid scripts.
My proposal is to introduce a warning (which could be disabled on demand) to inform user about this edge situation. An alternative solution would be to leave string literals as-is, that means if they were defined with unicode escape sequences, output them without transformation even under UTF-8 output mode.
For more background info you can read my comment[2] in ClojureScript JIRA, where we discovered this issue after enabling UTF-8 output by default.
[1] https://github.com/google/closure-library/blob/d66b94513df131c9776cdf70ac476bbb1a62e5d0/closure/goog/i18n/bidi.js#L202
[2] http://dev.clojure.org/jira/browse/CLJS-1547?focusedCommentId=42617
The text was updated successfully, but these errors were encountered: