Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when producing invalid UTF-8 output files #1704

Open
darwin opened this issue Apr 4, 2016 · 14 comments
Open

Warn when producing invalid UTF-8 output files #1704

darwin opened this issue Apr 4, 2016 · 14 comments
Assignees
Labels
internal-issue-created An internal Google issue has been created to track this GitHub issue P3

Comments

@darwin
Copy link

darwin commented Apr 4, 2016

When running the compiler with --charset UTF-8 one can potentially end up with not strictly valid output files.

The problem is that sometimes strings in input sources can be perfectly valid ASCII strings, but they encode invalid UTF-8 strings via unicode escape sequences. One such example can be found in Closure Library itself[1].

Closure Compiler when instructed to produce output in UTF-8 blindly assumes validity of such strings and outputs them as raw byte output (without checking). This can cause problems in some systems. For example in my case Google Chrome extension content script loading code is quite strict and uses UTF-8 validator to reject any non-valid scripts.

Invalid Script

My proposal is to introduce a warning (which could be disabled on demand) to inform user about this edge situation. An alternative solution would be to leave string literals as-is, that means if they were defined with unicode escape sequences, output them without transformation even under UTF-8 output mode.

For more background info you can read my comment[2] in ClojureScript JIRA, where we discovered this issue after enabling UTF-8 output by default.

[1] https://github.com/google/closure-library/blob/d66b94513df131c9776cdf70ac476bbb1a62e5d0/closure/goog/i18n/bidi.js#L202
[2] http://dev.clojure.org/jira/browse/CLJS-1547?focusedCommentId=42617

@junjiemars
Copy link

@darwin, Hi,
I got the same issue when ClojureScript -> JavaScript in Chrome Extension, How about UTF-8 js output now?

@darwin
Copy link
Author

darwin commented May 16, 2016

@junjiemars: for a workaround set :closure-output-charset "US-ASCII" in your ClojureScript compiler options[1], or downgrade ClojureScript compiler to a version not including this commit.

[1] https://github.com/binaryage/dirac/blob/a9f5d1c842ff43934b4fc5ed665d3fcd903a655c/project.clj#L191

@junjiemars
Copy link

Thanks @darwin ,
:closure-output-charset "US-ASCII" fix the issue.

junjiemars added a commit to junjiemars/owl that referenced this issue May 16, 2016
@ajchemist
Copy link

@darwin thanks! I came here during the work with binaryage/chromex.

@pesterhazy
Copy link

Dug into this issue a bit: https://github.com/pesterhazy/cljs-utf8-issue

Looks like Chrome is at fault here

@MatrixFrog
Copy link
Contributor

Do you have a repro that doesn't involve clojurescript? Maybe upload the JS file emitted by the ClojureScript compiler.

@pesterhazy
Copy link

pesterhazy commented Oct 13, 2017

@MatrixFrog I've uploaded the generated js file here: https://raw.githubusercontent.com/pesterhazy/cljs-utf8-issue/master/generated/main.js

You can see that, like Chrome, isutf8 from moreutil 0.59 (you'll need to compile it from git: git://git.joeyh.name/moreutils) considers this file invalid:

$ moreutils-0.59/isutf8 generated/main.js
generated/main.js: line 272, char 1, byte offset 217: invalid UTF-8 code

I tried to reproduce using current GCC and GCL but failed - the generated file didn't contain the offending byte sequence

goog.json.Serializer.charsToReplace_=/\uffff/.test("<U+FFFF>")

(as seen in less).

Compiled directly with closurebuilder.py I get

goog.json.Serializer.charsToReplace_=/\uffff/.test("\uffff")

@pesterhazy
Copy link

Actually let me take that back. I've pushed a new version to https://github.com/pesterhazy/cljs-utf8-issue. I had to pass the --charset param, which defaults to us_ascii. With UTF-8, the compilation fails:

$ bin/build-js
closure-library/closure/bin/build/closurebuilder.py: Scanning paths...
closure-library/closure/bin/build/closurebuilder.py: 1579 sources scanned.
closure-library/closure/bin/build/closurebuilder.py: Building dependency tree..
closure-library/closure/bin/build/closurebuilder.py: Closure Compiler now natively understands and orders Closure dependencies and
is prefererred over using this script for performing JavaScript compilation.

Please migrate your codebase.

See:
https://github.com/google/closure-compiler/wiki/Managing-Dependencies

closure-library/closure/bin/build/closurebuilder.py: Compiling with the following command: java -client -jar compiler-latest/closure-compiler-v20170910.jar --flagfile /var/folders/rj/8lnf662s3mlgzv5mcq2r2fnc0000gn/T/tmpMRVQ0d
closure-library/closure/bin/build/closurebuilder.py: JavaScript compilation succeeded.
Traceback (most recent call last):
  File "closure-library/closure/bin/build/closurebuilder.py", line 293, in <module>
    main()
  File "closure-library/closure/bin/build/closurebuilder.py", line 285, in main
    out.write(compiled_source.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 23486: ordinal not in range(128)

@pesterhazy
Copy link

All it takes is to set the charset to UTF-8 and require goog.net.XhrIo: https://github.com/pesterhazy/cljs-utf8-issue/blob/master/js-src/hello.js

@MatrixFrog
Copy link
Contributor

I'm getting slightly lost. Is this an issue in the Closure Compiler, or in closurebuilder.py?

@pesterhazy
Copy link

@MatrixFrog right you are.

I had to find out how to call the Closure Compiler jar including the Closure Library directly from the command line.

I've simplified the repo: https://github.com/pesterhazy/cljs-utf8-issue/. Now reproduction only requires one command bin/build-js.

The generated file is here: https://github.com/pesterhazy/cljs-utf8-issue/blob/master/out/output.js

@MatrixFrog MatrixFrog self-assigned this Oct 17, 2017
@MatrixFrog
Copy link
Contributor

+1 to just outputting ASCII as a workaround. However, it would be good for us to fix this as well.

Around https://github.com/google/closure-compiler/blob/master/src/com/google/javascript/jscomp/CodeGenerator.java#L1835 we check if the charset encoder can encode the character, but then we just do sb.append(c) instead of actually letting the encoder actually encode it. The simplest fix might to change how U+FFFF is handled in appendHexJavaScriptRepresentation: https://github.com/google/closure-compiler/blob/master/src/com/google/debugging/sourcemap/Util.java#L111

@MatrixFrog MatrixFrog added the P3 label Oct 17, 2017
@johnryan
Copy link

johnryan commented Oct 2, 2019

Is anyone aware of configuring this via the closure api ? I don't see any mention of the closure-output-charset in the docs: https://developers.google.com/closure/compiler/docs/api-ref

@brad4d brad4d added the internal-issue-created An internal Google issue has been created to track this GitHub issue label Oct 3, 2019
@brad4d
Copy link
Contributor

brad4d commented Oct 3, 2019

Internal Google issue http://b/142068222 created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal-issue-created An internal Google issue has been created to track this GitHub issue P3
Projects
None yet
Development

No branches or pull requests

7 participants