-
Notifications
You must be signed in to change notification settings - Fork 7.8k
mb_convert_encoding "\" (backslash) and "~" (tilde) convert failed to Shift_JIS #8281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @alexdowad |
Dear @youkidearitai, thanks very much for this report. I will explain the behavior that you are observing below. However, if you have other information or references which may help to make these matters more clear, that is welcome. The Wikipedia article on Shift-JIS is a good reference to start with. As it explains, single-byte characters in Shift-JIS are not in the ASCII character set, but rather in the JIS X 0201 character set. In JIS X 0201, 0x5C represents a Yen sign (¥, U+00A5). Also, in JIS X 0201, 0x7E represents an overline or overbar (‾, U+203E). This is different from ASCII, where 0x5C represents a backslash (\) and 0x7E represents a tilde (~). This means that when converting UTF-8 to Shift-JIS, we cannot correctly convert ASCII 0x5C to a Shift-JIS 0x5C byte, as that would change its meaning. However, Shift-JIS can also represent characters in the JIS X 0208 character set, using 2 bytes per character. Fortunately, JIS X 0208 kuten code 0x2141 is a "wave dash", which is similar to a tilde. JIS X 0208 kuten code 0x2140 is a backslash. So we can convert U+005C to the JIS X 0208 backslash, and U+007E to the JIS X 0208 wave dash. The problem comes when converting in the reverse direction. ASCII has (halfwidth) backslash and tilde. Unicode has both halfwidth and fullwidth backslashes and tildes. However, JIS X 0208 has only one backslash character, which is generally treated as fullwidth, and one "wave dash" character, which is also treated as fullwidth. (JIS X 0201 has neither.) The upshot of all this is that converting from Unicode → JISX 0201/0208 → Unicode is not a lossless conversion. If we convert JIS X 0208 0x2141 to the halfwidth tilde, then you may be happy, but others who were expecting to get a fullwidth tilde will not be. Likewise if we convert JIS X 0208 backslash to the halfwidth backslash. You might wonder why you are only seeing this behavior on PHP 8.1. In short, it is because Generally, converting text back and forth between different legacy text encodings and expecting to get back what you started with is problematic. Since Unicode was designed to be a superset of all previous text encodings, it is generally best to do all processing in Unicode if possible. If that's not possible, then the best thing to do depends on the situation. If you must receive text in a legacy encoding and output it in the same legacy encoding, it may be best to do all the processing in the same encoding rather than converting to and from Unicode. On the other hand, if you receive text in legacy encodings but do not need to output it, then it would be better to convert to Unicode immediately when ingesting the text, and never convert it back. (Note that what you are doing in the sample code is something which is almost never necessary or advisable: taking nice Unicode text, converting it to a legacy format, and then back to Unicode again. Converting from legacy → Unicode → legacy might sometimes make sense; converting from Unicode → legacy → Unicode almost never does.) Please feel free to share any clarifying remarks, and thanks again. |
Dear, @alexdoward. thank you very much for reply. I know ASCII backslash(0x5C) and tilde(0x7E) is different in JIS X 0201.But most Japanese users 0x5C and 0x7E is not using strict convert to 0x5C to U+00A5 and 0x7E to U+203E. At least, it can be said that there are very few cases where the convert to backslash and tilde is multibyte. Japanese Wikipedia Yen sign problem section in "現実的解決" (Realistic solution) explain Realistic solution.
Some Japanese see to Yen sign, but code is almost 0x5C. From this, Japanese language is through in general case ASCII and JIS X 0201 to 0x5C and 0x7E as it is. Reference: プログラマのための文字コード入門 (ISBN978-4-7741-4164-0) Page 316 - 318 |
@youkidearitai, Thanks for those references. This was particularly interesting:
That is completely insane. Whoever came up with that idea deserves an award for bad design. Thanks also for the reference to the book プログラマのための文字コード入門, though it doesn't seem there is any way I can access a copy right now. I will be in Japan in a few months (COVID-19 situation allowing) and could look it up then, but it would be nice to conclude this issue faster than that. 😆 As we consider this matter further, I think it would also help if you could explain: What is your use case here? How would treating SJIS 0x5C as a halfwidth backslash (U+005C) rather than a Yen sign (U+00A5) affect the software which you are developing or maintaining? What is the typical source of SJIS text data for PHP-based software that you are working on, or are aware of? Text files uploaded by users? Direct text inputs in entry fields on a web site, from users whose OS supports SJIS input? Are we talking about newly created data, or legacy data which has been around for a long time, but you still need to process? |
@alexdowad , Thanks for reply. I'll answer questions. Shift_JIS is still used in active use, and a common use case is importing CSV into Excel. Since Excel could only import CSV with Shift_JIS for a long time, there may still be cases where it is converted with SJIS. In other words, mb_convert_encoding is done when downloading and uploading CSV. This conversion of \0x5C and \0x7E can be confusing. In most Japanese character code implementations, it seems customary to convert \0x5C and \0x7E untouched. Japanese PHP users may have other opportunities to use Shift_JIS, so I was very interested in talking about this problem. |
OK. So you have users uploading CSVs which are Shift-JIS encoded, and you also export Shift-JIS-encoded CSVs for download? Is it correct to say that when your users upload Shift-JIS-encoded CSVs, those may include 0x5C bytes, and the users expect those to be treated as backslashes? Do you offer options for the encoding of these files? Or you export in Shift-JIS encoding by default, because that is what works for the greatest number of your users?
That's true; but remembering the context here, we are looking at use cases for conversion between Shift-JIS and Unicode, and trying to determine whether there are more cases where interpreting 0x5C/0x7E according to spec is preferable, or more cases where interpreting them as ASCII is preferable. The discussion is not general, but specific to Does Shift-JIS text that PHP-based applications receive from users typically include Windows pathnames? After converting to ASCII/UTF-8/etc., would your PHP code typically take those pathnames and interpolate them into a Windows shell command, or pass them to
It is certainly confusing. My preference is to follow published specifications when possible; I believe this tends to reduce confusion in the long term. However, if there are strong practical reasons to deviate from specifications, that can certainly be done. However, we do not want to flip-flop back and forth. To avoid flip-flopping, we need to thoroughly understand all the implications of either following the spec or deviating from it. After all factors are considered, and as many interested parties as possible are consulted, if the final decision is to change, then we should document the reason for the decision and stick to it. One of the great challenges involved in working on open-source, and especially popular projects like PHP, is that users only speak up when something is not working well for them. With proprietary, in-house software, you generally know who all the users are and can survey them to see what they think about proposed changes. With open source, you often only discover later how your users were impacted by some change. This is a good reason to carefully think changes through and gather as much information as possible before deciding. You mentioned that it seems customary to treat SJIS 0x5C and 0x7E as ASCII; if you can share as many specific examples as possible of existing software which does or doesn't do this, that would be appreciated. What is 'customary' is definitely one important factor to consider, since it shapes people's expectations. |
There are tons of cases where Unicode and Shift_JIS conversions are involved in CSV uploads and downloads, and it can be difficult to find out how much they are. However, many users will find it very difficult for 0x5C and 0x7E to have "strict" conversions the moment they upgrade to PHP 8.1. This is good enough for Japanese users to hesitate to upgrade to PHP 8.1. This is because it is perceived by the Japanese as being converted to a different character.
As you said, I think it is correct to follow the published specifications. However, this is a change that breaks backwards compatibility, and if so, I feel that this change should be discussed in PHP RFCs and so on. As about for the convention(customary), at least in Python 3, even if 0x5C or 0x7E is converted to Shift_JIS, it is converted as it is.
I tried it with Ruby 2.7. After all it converted as it is.
|
Apparently, there are several variations of Shift JIS in use, of which some are already supported by MBString. In this case @youkidearitai is likely looking for Windows CP 932 ( I think this is just something that we should document better. |
@cmb69, excellent point! Indeed, CP932 is Microsoft's version of Shift-JIS. |
As a Japanese user, it's a sad that it wasn't communicated correctly. |
Indeed. Probably that was because the change was considered to be a "bug fix". Although @cmb69 has made a good point (that the text encoding which you are interested in does actually still exist under a different name), I don't think that is necessarily the 'last word' on this issue. We are still open for suggestions. |
Just did a brief search for Composer packages which might be affected by this issue. @SUKOHI's FluentCsv library uses However, @gh640 has another library called sjis-zip which uses Comments from any of these developers on this GH issue would be much appreciated. Does anyone have a good way to do a text search across all Composer packages? I seem to remember that @nikic has, on occasion, mentioned that "the top 2000 Composer packages don't use such-and-such"; maybe he has something along those lines? |
Hi @alexdowad. First of all, I would like to express my gratitude and respect to you and the original developers of mbstring, as I know your refactoring achievements through this article. Conversion between multiple character sets is always difficult, and this problem has plagued the Japanese for 30 years with Unicode, and the conversion map between JIS and Unicode in the original mbstring is not just a bug in the spec. As the behavior of Ruby shows, it is based on the customs and use cases of many Japanese users from that time to the present. unicode.org provides JIS and Unicode conversion maps at https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/, but this is not part of the standard and is just reference data. Please note in particular. Replacing The difference between these ideas is also expressed in iconv and nkf. The Japanese-implemented nkf is not a standard command, but it is still used by old school Japanese UNIX users.
In addition to Wikipedia, several companies and Japanese developers explain how to implement JIS and Unicode mapping.
As mentioned in the discussion so far, Microsoft has burned an extraordinary obsession with displaying backslash as Microsoft still describes CP932 to users as Shift_JIS or "シフトJIS", so due to their efforts many Japanese are unaware of these encodings and character sets. The same is true for many Japanese PHP programmers. This is a post by a forum user, but it contains an Excel screen that says "シフトJIS" (Shift_JIS). Although Microsoft has increased Unicode support for Excel in recent years, Japanese users still believe that converting to Shift_JIS(CP932) for importing and exporting data between the Web and Excel is a safe and secure method.
These are just a few, and some of these articles show code that outputs broken CSV, but many Japanese users are more concerned about "文字化け" (mojibake). It is believed that many Japanese companies using Windows still use that method. Unfortunately, many of them are not interested in disseminating information to the tech community. (Moreover, many programmers hired by such companies may not be aware of Composer's existence...) I think changing the character encoding and conversion map should have been a careful debate, but I agree that "flip-flops" can cause further confusion. Converting all The improvement I suggest is to specify in PHP: mb_convert_encoding - Manual that the conversion map has changed in PHP 8.1 and provide a backwards compatible and secure workaround. Since the only characters converted from ASCII in PHP 8.1 are $str = strtr(mb_convert_encoding($str, 'UTF-8', 'SJIS'), ['¥' => '\\', '‾' => '~']); I am grateful to all of you for your efforts on these issues. |
@zonuexe Wow!!! I am floored by the quality, detail, and lucidity of your comment here. (Picks jaw up from floor.) Thanks for your contribution to this discussion. It will take me some time to digest all the references in your comment. However, if I can ask a couple of questions first...
|
@alexdowad Thank you for your response.
This is a difficult decision, but I think introducing strict mode with ini or other parameters will increase uncertainty and make it difficult to predict and convert results. Another option is to add another encoding name like "SJIS-strict" or "SJIS-compat", but it seems a bit obscure. Overall, I think it is realistic to document the current implementation. |
As a Japanese, I think rollbacking for now is also an OK choice. https://packagist.org/php-statistics |
There are good reasons and pains for both rolling back and not rolling back, so take my opinion as one of the judgments. As @sj-i said on Twitter, PHP's mbstring has a 20-year history, and I found it important to point out that its behavior is a not-so-small part of the SJIS conversion convention. |
I can easily imagine there are lot of website and system have feature depends on existing mbstring behavior around those encoding handling. and if they notice BC break on latest PHP, they may stop upgrade to newer version. since real world is chaotic around Japanese character encoding, introduce "clean" solutions is very difficult. |
BTW, on the current implementation, SJIS-win is an alias of CP932, thus the return value of mb_list_encodings() never contains 'SJIS-win'. I have created a separate issue for this. #8308 @alexdowad |
Thanks for opening that issue. I have added some comments there. |
Just read a few of the articles linked to above by @zonuexe, still remaining with a few more to read. |
@alexdowad |
The discussion here seems to have fallen quiet. I think the discussion in #8308 has convinced me that adjusting mappings for I would like to submit a PR to revert that change, but first, are there any other mappings for SJIS or SJIS variants which are a concern for @youkidearitai or other interested parties who are following this thread? Or is it just SJIS 0x5C → Unicode and SJIS 0x7E → Unicode? |
Sorry I mistake close this Issue. I also want to hear what Japanese people think. |
I've gathered various opinions on this issue, |
@youkidearitai Absolutely, that will be done! Thank you very much for reporting this issue and for following up on it. |
Thank you, everyone! |
Description
Backslash(\) and tilde(~) is convert to Shift_JIS (SJIS) using mb_convrert_encoding, But converted word is wrong word. Please see below code and 3v4l https://3v4l.org/nSVPB. Reproduced only PHP 8.1.
The following code:
Resulted in this output:
But I expected this output instead:
PHP Version
PHP 8.1.4
Operating System
No response
The text was updated successfully, but these errors were encountered: