Warn when producing invalid UTF-8 output files #1704

darwin · 2016-04-04T19:20:18Z

When running the compiler with --charset UTF-8 one can potentially end up with not strictly valid output files.

The problem is that sometimes strings in input sources can be perfectly valid ASCII strings, but they encode invalid UTF-8 strings via unicode escape sequences. One such example can be found in Closure Library itself[1].

Closure Compiler when instructed to produce output in UTF-8 blindly assumes validity of such strings and outputs them as raw byte output (without checking). This can cause problems in some systems. For example in my case Google Chrome extension content script loading code is quite strict and uses UTF-8 validator to reject any non-valid scripts.

My proposal is to introduce a warning (which could be disabled on demand) to inform user about this edge situation. An alternative solution would be to leave string literals as-is, that means if they were defined with unicode escape sequences, output them without transformation even under UTF-8 output mode.

For more background info you can read my comment[2] in ClojureScript JIRA, where we discovered this issue after enabling UTF-8 output by default.

[1] https://github.com/google/closure-library/blob/d66b94513df131c9776cdf70ac476bbb1a62e5d0/closure/goog/i18n/bidi.js#L202
[2] http://dev.clojure.org/jira/browse/CLJS-1547?focusedCommentId=42617

The text was updated successfully, but these errors were encountered:

junjiemars · 2016-05-16T09:25:41Z

@darwin, Hi,
I got the same issue when ClojureScript -> JavaScript in Chrome Extension, How about UTF-8 js output now?

darwin · 2016-05-16T09:32:40Z

@junjiemars: for a workaround set :closure-output-charset "US-ASCII" in your ClojureScript compiler options[1], or downgrade ClojureScript compiler to a version not including this commit.

[1] https://github.com/binaryage/dirac/blob/a9f5d1c842ff43934b4fc5ed665d3fcd903a655c/project.clj#L191

junjiemars · 2016-05-16T09:43:14Z

Thanks @darwin ,
:closure-output-charset "US-ASCII" fix the issue.

ajchemist · 2016-06-19T09:46:05Z

@darwin thanks! I came here during the work with binaryage/chromex.

pesterhazy · 2017-10-13T14:35:56Z

Dug into this issue a bit: https://github.com/pesterhazy/cljs-utf8-issue

Looks like Chrome is at fault here

MatrixFrog · 2017-10-13T17:35:15Z

Do you have a repro that doesn't involve clojurescript? Maybe upload the JS file emitted by the ClojureScript compiler.

pesterhazy · 2017-10-13T21:13:45Z

@MatrixFrog I've uploaded the generated js file here: https://raw.githubusercontent.com/pesterhazy/cljs-utf8-issue/master/generated/main.js

You can see that, like Chrome, isutf8 from moreutil 0.59 (you'll need to compile it from git: git://git.joeyh.name/moreutils) considers this file invalid:

$ moreutils-0.59/isutf8 generated/main.js
generated/main.js: line 272, char 1, byte offset 217: invalid UTF-8 code

I tried to reproduce using current GCC and GCL but failed - the generated file didn't contain the offending byte sequence

goog.json.Serializer.charsToReplace_=/\uffff/.test("<U+FFFF>")

(as seen in less).

Compiled directly with closurebuilder.py I get

goog.json.Serializer.charsToReplace_=/\uffff/.test("\uffff")

pesterhazy · 2017-10-13T21:21:41Z

Actually let me take that back. I've pushed a new version to https://github.com/pesterhazy/cljs-utf8-issue. I had to pass the --charset param, which defaults to us_ascii. With UTF-8, the compilation fails:

$ bin/build-js
closure-library/closure/bin/build/closurebuilder.py: Scanning paths...
closure-library/closure/bin/build/closurebuilder.py: 1579 sources scanned.
closure-library/closure/bin/build/closurebuilder.py: Building dependency tree..
closure-library/closure/bin/build/closurebuilder.py: Closure Compiler now natively understands and orders Closure dependencies and
is prefererred over using this script for performing JavaScript compilation.

Please migrate your codebase.

See:
https://github.com/google/closure-compiler/wiki/Managing-Dependencies

closure-library/closure/bin/build/closurebuilder.py: Compiling with the following command: java -client -jar compiler-latest/closure-compiler-v20170910.jar --flagfile /var/folders/rj/8lnf662s3mlgzv5mcq2r2fnc0000gn/T/tmpMRVQ0d
closure-library/closure/bin/build/closurebuilder.py: JavaScript compilation succeeded.
Traceback (most recent call last):
  File "closure-library/closure/bin/build/closurebuilder.py", line 293, in <module>
    main()
  File "closure-library/closure/bin/build/closurebuilder.py", line 285, in main
    out.write(compiled_source.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 23486: ordinal not in range(128)

pesterhazy · 2017-10-13T21:22:26Z

All it takes is to set the charset to UTF-8 and require goog.net.XhrIo: https://github.com/pesterhazy/cljs-utf8-issue/blob/master/js-src/hello.js

MatrixFrog · 2017-10-13T21:39:37Z

I'm getting slightly lost. Is this an issue in the Closure Compiler, or in closurebuilder.py?

pesterhazy · 2017-10-17T20:46:26Z

@MatrixFrog right you are.

I had to find out how to call the Closure Compiler jar including the Closure Library directly from the command line.

I've simplified the repo: https://github.com/pesterhazy/cljs-utf8-issue/. Now reproduction only requires one command bin/build-js.

The generated file is here: https://github.com/pesterhazy/cljs-utf8-issue/blob/master/out/output.js

MatrixFrog · 2017-10-17T21:08:59Z

+1 to just outputting ASCII as a workaround. However, it would be good for us to fix this as well.

Around https://github.com/google/closure-compiler/blob/master/src/com/google/javascript/jscomp/CodeGenerator.java#L1835 we check if the charset encoder can encode the character, but then we just do sb.append(c) instead of actually letting the encoder actually encode it. The simplest fix might to change how U+FFFF is handled in appendHexJavaScriptRepresentation: https://github.com/google/closure-compiler/blob/master/src/com/google/debugging/sourcemap/Util.java#L111

johnryan · 2019-10-02T17:58:42Z

Is anyone aware of configuring this via the closure api ? I don't see any mention of the closure-output-charset in the docs: https://developers.google.com/closure/compiler/docs/api-ref

brad4d · 2019-10-03T18:53:34Z

Internal Google issue http://b/142068222 created

junjiemars added a commit to junjiemars/owl that referenced this issue May 16, 2016

fix google/closure-compiler#1704

50b594f

trieloff mentioned this issue Jul 3, 2017

Deployment fails with error 502 for larger script files adobe-apiplatform/openwhisk-github-deployer#7

Open

MatrixFrog self-assigned this Oct 17, 2017

MatrixFrog added the P3 label Oct 17, 2017

brad4d added the internal-issue-created An internal Google issue has been created to track this GitHub issue label Oct 3, 2019

ruimarinho mentioned this issue Oct 13, 2019

Encoding issue on 3.2.3 and above ruimarinho/google-libphonenumber#212

Closed

0xE1 mentioned this issue Feb 20, 2020

Cannot be loaded as part of extension in Chrome dymosoftware/dymo-connect-framework#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn when producing invalid UTF-8 output files #1704

Warn when producing invalid UTF-8 output files #1704

darwin commented Apr 4, 2016 •

edited

junjiemars commented May 16, 2016

darwin commented May 16, 2016 •

edited

junjiemars commented May 16, 2016

ajchemist commented Jun 19, 2016

pesterhazy commented Oct 13, 2017

MatrixFrog commented Oct 13, 2017

pesterhazy commented Oct 13, 2017 •

edited

pesterhazy commented Oct 13, 2017

pesterhazy commented Oct 13, 2017

MatrixFrog commented Oct 13, 2017

pesterhazy commented Oct 17, 2017

MatrixFrog commented Oct 17, 2017

johnryan commented Oct 2, 2019

brad4d commented Oct 3, 2019

Warn when producing invalid UTF-8 output files #1704

Warn when producing invalid UTF-8 output files #1704

Comments

darwin commented Apr 4, 2016 • edited

junjiemars commented May 16, 2016

darwin commented May 16, 2016 • edited

junjiemars commented May 16, 2016

ajchemist commented Jun 19, 2016

pesterhazy commented Oct 13, 2017

MatrixFrog commented Oct 13, 2017

pesterhazy commented Oct 13, 2017 • edited

pesterhazy commented Oct 13, 2017

pesterhazy commented Oct 13, 2017

MatrixFrog commented Oct 13, 2017

pesterhazy commented Oct 17, 2017

MatrixFrog commented Oct 17, 2017

johnryan commented Oct 2, 2019

brad4d commented Oct 3, 2019

darwin commented Apr 4, 2016 •

edited

darwin commented May 16, 2016 •

edited

pesterhazy commented Oct 13, 2017 •

edited