Skip to content

New Check: Utf8EncodingCheck #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rdiachenko opened this issue Sep 2, 2014 · 4 comments
Closed

New Check: Utf8EncodingCheck #265

rdiachenko opened this issue Sep 2, 2014 · 4 comments

Comments

@rdiachenko
Copy link
Contributor

Source files have to be UTF-8 encoded: http://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s2.2-file-encoding

@rdiachenko
Copy link
Contributor Author

@romani commented on Jun 7

Byte order mark is not requirement for files - http://en.wikipedia.org/wiki/Byte-order_mark#UTF-8

23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ sudo apt-get install moreutils
.....
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i pom.xml 
pom.xml: application/xml; charset=utf-8
23:11 ~/java/git-others/checkstyle/checkstyle [master|✔] $ file -i import-control.xml 
import-control.xml: application/xml; charset=us-ascii
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 pom.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ isutf8 import-control.xml 
23:12 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd pom.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 2220 656e 636f 6469 6e67 3d22 5554  .0" encoding="UT
23:13 ~/java/git-others/checkstyle/checkstyle [master|✔] $ xxd import-control.xml | head -2 
0000000: 3c3f 786d 6c20 7665 7273 696f 6e3d 2231  <?xml version="1
0000010: 2e30 223f 3e0a 3c21 444f 4354 5950 4520  .0"?>.<!DOCTYPE 

We might need to port "isutf8" application from C++ to Java, sources https://joeyh.name/code/moreutils/ , file "isutf8.c".

Attention: we cannot force to use only utf-8!!!, any ascii is more preferable and should be accepted, see my example above.

We might need to use - http://jchardet.sourceforge.net/ , that could give us full functional support for most of encoding detection (not only utf-8).

@rdiachenko
Copy link
Contributor Author

@maxvetrenko commented on Aug 31

I read that InputStream uses operation system encoding. All libs read bytes from InputStream, so all already bytes encoded in operation system encoding.
I ran into the same problem: http://stackoverflow.com/questions/8305635/javahow-can-i-get-the-encoding-from-inputstream

@rdiachenko
Copy link
Contributor Author

Here's my investigation of encoding detection by:

  1. linux command "find -ib file"
  2. juniversalchardet (https://code.google.com/p/juniversalchardet/)
  3. jChardet (http://jchardet.sourceforge.net/)
Actual encoding $find -ib file juniversalchardet jChardet
Windows-1250 charset=unknown-8bit WINDOWS-1252 windows-1252
ISO8859-2 charset=iso-8859-1 ISO-8859-7 ISO-8859-7
CP866 charset=iso-8859-1 ISO-8859-5 windows-1252
KOI8-R charset=utf-8 UTF-8 UTF-8
GBK charset=iso-8859-1 IBM866 [UTF-16BE, Big5, GB18030, UTF-16LE]
SHIFT_JIS charset=utf-8 UTF-8 UTF-8
ISO2022-KR charset=utf-8 UTF-8 UTF-8
UTF-8 charset=us-ascii No encoding detected ASCII

I used files of different encoding types with the corresponding content as input on Linux OS (Fedora). The output may be different on Windows OS.

We can't say for sure what is the file's encoding. It is not the task for Checkstyle

@rdiachenko
Copy link
Contributor Author

Won't fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant