-
Notifications
You must be signed in to change notification settings - Fork 1.6k
GetCharCount either isn't simulating, or isn't simulating as described #11124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/area-system-text-encoding |
The doc is already saying |
So, I've played with this some more, and I think I know what's going on. I'm laying this out mostly to put my own thoughts in order, but it's (barely) possible this might be of value to some future googler. There's a TLDR at the bottom. While we're talking about Decoder.GetCharCount, it helps if we begin with Decoder.GetChars. Let's look at a specific example:
This is a UTF8 sequence of bytes. It contains a 4 byte sequence for an emoji of a smiley face, followed by a capitol letter 'B' ("😀B"). When this is decoded, it will produce 3 characters. The emoji takes 2 characters to represent, plus one more character for the letter B. So, what happens if you call:
It goes to decode the sequence, but it's going to need all 4 bytes in order to decode that smiley face, and I've limited the range to 3. Such being the case, it can't decode anything. a1 returns 0, and the three bytes get cached in the decoder. If I make a second call:
It uses the 3 leftover bytes from the first call, along with the 2 bytes from this call, and returns us all 3 characters. With that in mind, what does this mean for GetCharCount? Let's start with flush = false:
Since it can't decode the full sequence, b1 returns 0. That's consistent with what GetChars will return, and helps us correctly allocate the right size of buffer needed to contain the GetChars output. What about:
Again, both b3 and a3 return 0, since there isn't enough data to decode any characters. However, the leftover bytes from that first GetChars call are still hanging around, so b4 and a4 are both able to return 3. Exactly what we want. Ok, that's with using 'false.' What if you set the 'simulate' to true?
Now b5 returns 1. Why does it do that? Ahh. Assume for a moment we're at the end of the stream. What should the decoder do? It's been told that it's never going to be passed the rest of the sequence, but it's got 3 bytes to decode. What it does is returns 1, because that's what the GetChars is going to return. And that character will be the 'Unrecognized character' ('�'), which is what the decoder always returns when it encounters corrupt or incomplete data. So, by telling GetCharCount to "simulate clearing the internal state of the encoder after the calculation," we are affecting what it returns. You see that same result from GetChar (as expected):
Both b5 and a5 return 1, recognizing that it's not going to get any more data to resolve the sequence. b6 and a6 both return 2, and carray contains two characters: the unknown character and 'B'. I can think of several occasions where this situation might occur:
To sum up (TLDR):
However as we've seen, the existing text is also confusing to people who know just a little bit about how this works. Since we don't want to put all this junk into the docs, perhaps it's enough just to change the places that have a flush parameter to say:
While it still leaves a lot unsaid, at least it gives a clue about how to think about it. Might be just enough to allow the next guy to puzzle out how this all works. If this change still makes sense to you, let me know how to proceed. |
Type of issue
Missing information
Description
After? What possible value could there be to simulating clearing the internal state after the calculation? Once calculation is complete, the method is done. Simulating a cleared buffer at that point makes no sense. And if we are indeed just 'simulating,' nothing about the decoder's state is actually changing so it's not like any subsequent calls are going to be impacted.
Surely you must mean BEFORE the calculation. If there's leftover data from a previous call to GetChars, whether you include any leftover bytes or not can certainly make a difference in the GetCharCount results. This is also consistent with what I see when I experiment with setting the flag.
Having said that, I just took a look at the source. And despite the fact that we're 'simulating,' there ARE changes to the state of the decoder from the GetCharCount call. Both a variable named _mustFlush and _throwOnOverflow get modified. And while neither variable is read in that source file (I'm looking at DecoderNLS.cs), the values are exposed to callers, so who knows what's up with that.
Simulating clearing the internal state of the encoder 'before' makes sense, but this whole page repeatedly asserts the opposite. And if you really do mean after, then some discussion of what happens after that would be affected by this simulation would be in order.
Page URL
https://learn.microsoft.com/en-us/dotnet/api/system.text.decoder.getcharcount?view=net-9.0#system-text-decoder-getcharcount(system-byte*-system-int32-system-boolean)
Content source URL
https://github.com/dotnet/dotnet-api-docs/blob/main/xml/System.Text/Decoder.xml
Document Version Independent Id
f6d0f7c6-74c8-db97-da0e-d17770e3360e
Platform Id
d84fef13-e343-6d3c-0650-7ead5aaf0401
Article author
@dotnet-bot
The text was updated successfully, but these errors were encountered: