-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buffer.indexOf is incorrect in utf16le encoding for odd byteOffset #26448
Comments
It would be much better if you showed exactly what the error log displays. |
@hkjackey ucs2 (utf16le) encoding is always two bytes. So byteOffset must by multiple of 2. |
Currently, indexOf with UTF16-LE is odd anyway in another respect as well, namely that we only search for the needle at even indices: > Buffer.from('00aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
-1
> Buffer.from('0000aaaa', 'hex').indexOf('\uaaaa', 'utf16le')
2 I’m not sure if this is “correct”, because some people might expect this? /cc @nodejs/buffer |
I think it's correct as we wouldn't really want to break codepoint boundaries (although need to recheck if same holds for UTF-8 or any other encoding). |
You mean code unit boundaries. This isn't relevant for UTF-8 because a UTF-8 code unit is a byte. I suggest closing as expected behavior. |
No, I do mean a codepoint (1-4 bytes in case of UTF-8). |
In UTF-16 a codepoint can be represented by one or two 16-bit code units. I don't think Buffer.indexOf has any checks for that. |
Hmm that's another good question (although personally I'd prefer it to check if it doesn't). |
@mkopa Thanks for the reply. I agree that utf16le encoding is always two bytes, but buffer is not. I regarded the current behavior as a bug because according to the NodeJS specification: I would accept the current behavior not a bug if the NodeJS specification mentioned that. |
Hi everybody! Please consider the following 5 lines of code: I write a string at odd byteOffset 1 but finding it using indexOf at the same position |
That might be a breaking change - if the UTF-16 data starts at an even offset, However, the behavior in the issue description is clearly a bug - the offset should be rounded up to an even value, which is a simple fix. Thoughts? |
I would disagree – I think |
@seishun Would that work, just using |
If the offset is odd, |
@seishun Yeah, I think that’s the behaviour that I would expect – by specifying an odd offset to begin with rather than an even one, I’m saying that I want to look at odd + even pairs rather than even + odd ones. |
That seems unintuitive. Plus it wouldn't fix the original issue - |
Thank everybody for handling this issue! |
@BridgeAR Have you tried it? let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
console.log(buf.slice(1).indexOf('\u6881', 'utf16le')); |
@seishun Yeah, I think that’s (another) bug then. :/ |
@addaleax So what would be the expected behaviour? |
@seishun I would say in your example both methods should return -1. I think it shouldn’t matter whether a i.e., if |
I suggest returning |
So we have a conflict of opinions. @addaleax (and @BridgeAR?) thinks How should we proceed? |
I’m personally also okay with what @hkjackey’s suggestion – that’s most consistent with what |
That brings us back to my comment. If you think it's fine then I can just delete that test and my PR is practically ready. |
Previously, buffer.indexOf would look only at even indexes when the encoding is UTF-16. This behavior is undocumented and unintuitive. Detemplatize string_search.h because it is now used with only one type. Fixes: nodejs/node#26448
There's been no further discussion or activity here but it appears the originally reported behavior is still occurring. Is this something we should go ahead and fix or should we just document the issue and move on? |
I sincerely appreciate everyone's efforts over the past 6 years. |
Version: v11.10.1
Platform: Windows 10 (64-bits)
Subsystem: buffer
File Encoding: UTF8
Please consider the following 2 lines of code:
let buf = Buffer.from('\u6881\u6882\u6881', 'utf16le');
console.log(buf.indexOf('\u6881', 1, 'utf16le'));
In this example, the expected correct output should be 4.
However, the result is 0.
The text was updated successfully, but these errors were encountered: