Skip to content

GetCharCount either isn't simulating, or isn't simulating as described #11124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rindlespot opened this issue Mar 27, 2025 · 3 comments
Open

Comments

@rindlespot
Copy link

Type of issue

Missing information

Description

simulate clearing the internal state of the encoder after the calculation

After? What possible value could there be to simulating clearing the internal state after the calculation? Once calculation is complete, the method is done. Simulating a cleared buffer at that point makes no sense. And if we are indeed just 'simulating,' nothing about the decoder's state is actually changing so it's not like any subsequent calls are going to be impacted.

Surely you must mean BEFORE the calculation. If there's leftover data from a previous call to GetChars, whether you include any leftover bytes or not can certainly make a difference in the GetCharCount results. This is also consistent with what I see when I experiment with setting the flag.

Having said that, I just took a look at the source. And despite the fact that we're 'simulating,' there ARE changes to the state of the decoder from the GetCharCount call. Both a variable named _mustFlush and _throwOnOverflow get modified. And while neither variable is read in that source file (I'm looking at DecoderNLS.cs), the values are exposed to callers, so who knows what's up with that.

Simulating clearing the internal state of the encoder 'before' makes sense, but this whole page repeatedly asserts the opposite. And if you really do mean after, then some discussion of what happens after that would be affected by this simulation would be in order.

Page URL

https://learn.microsoft.com/en-us/dotnet/api/system.text.decoder.getcharcount?view=net-9.0#system-text-decoder-getcharcount(system-byte*-system-int32-system-boolean)

Content source URL

https://github.com/dotnet/dotnet-api-docs/blob/main/xml/System.Text/Decoder.xml

Document Version Independent Id

f6d0f7c6-74c8-db97-da0e-d17770e3360e

Platform Id

d84fef13-e343-6d3c-0650-7ead5aaf0401

Article author

@dotnet-bot

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 27, 2025
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 27, 2025
@jozkee jozkee added area-System.Text.Encoding and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Mar 28, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-encoding

@tarekgh
Copy link
Member

tarekgh commented Mar 28, 2025

The doc is already saying A parameter indicates whether to clear the internal state of the decoder after the calculation.. We may use the exact same text to replace simulate clearing the internal state of the encoder after the calculation. @rindlespot are you interested to submit a PR for this?

@tarekgh tarekgh added this to the Backlog milestone Mar 28, 2025
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Mar 28, 2025
@rindlespot
Copy link
Author

So, I've played with this some more, and I think I know what's going on. I'm laying this out mostly to put my own thoughts in order, but it's (barely) possible this might be of value to some future googler. There's a TLDR at the bottom.

While we're talking about Decoder.GetCharCount, it helps if we begin with Decoder.GetChars. Let's look at a specific example:

byte[] barray = new byte[] { 0xf0, 0x9f, 0x98, 0x80, 0x42 };

This is a UTF8 sequence of bytes. It contains a 4 byte sequence for an emoji of a smiley face, followed by a capitol letter 'B' ("😀B"). When this is decoded, it will produce 3 characters. The emoji takes 2 characters to represent, plus one more character for the letter B.

So, what happens if you call:

int a1 = decoder.GetChars(barray, 0, 3, carray, 0, false);

It goes to decode the sequence, but it's going to need all 4 bytes in order to decode that smiley face, and I've limited the range to 3. Such being the case, it can't decode anything. a1 returns 0, and the three bytes get cached in the decoder. If I make a second call:

int a2 = decoder.GetChars(barray, 3, 2, carray, 0, false);

It uses the 3 leftover bytes from the first call, along with the 2 bytes from this call, and returns us all 3 characters.

With that in mind, what does this mean for GetCharCount? Let's start with flush = false:

int b1 = decoder.GetCharCount(barray, 0, 3, false);

Since it can't decode the full sequence, b1 returns 0. That's consistent with what GetChars will return, and helps us correctly allocate the right size of buffer needed to contain the GetChars output. What about:

    int b3 = decoder.GetCharCount(barray, 0, 3, false);
    int a3 = decoder.GetChars(barray, 0, 3, carray, 0, false);

    int b4 = decoder.GetCharCount(barray, 3, 2, false);
    int a4 = decoder.GetChars(barray, 3, 2, carray, 0, false);

Again, both b3 and a3 return 0, since there isn't enough data to decode any characters. However, the leftover bytes from that first GetChars call are still hanging around, so b4 and a4 are both able to return 3. Exactly what we want.

Ok, that's with using 'false.' What if you set the 'simulate' to true?

    int b5 = decoder.GetCharCount(barray, 0, 3, true);

Now b5 returns 1. Why does it do that? Ahh.

Assume for a moment we're at the end of the stream. What should the decoder do? It's been told that it's never going to be passed the rest of the sequence, but it's got 3 bytes to decode. What it does is returns 1, because that's what the GetChars is going to return. And that character will be the 'Unrecognized character' ('�'), which is what the decoder always returns when it encounters corrupt or incomplete data.

So, by telling GetCharCount to "simulate clearing the internal state of the encoder after the calculation," we are affecting what it returns. You see that same result from GetChar (as expected):

int b5 = decoder.GetCharCount(barray, 0, 3, true);
int a5 = decoder.GetChars(barray, 0, 3, carray, 0, true);

int b6 = decoder.GetCharCount(barray, 3, 2, true);
int a6 = decoder.GetChars(barray, 3, 2, carray, 0, true);

Both b5 and a5 return 1, recognizing that it's not going to get any more data to resolve the sequence. b6 and a6 both return 2, and carray contains two characters: the unknown character and 'B'.

I can think of several occasions where this situation might occur:

  • At the end of a stream.
  • If we're about to Seek to a new position within the file.
  • If we just want to extract a short section of characters from a stream ("In energy news today, a new breakthrough...").

To sum up (TLDR):

  • The existing docs are correct and GetCharCount does 'simulate clearing the internal state of the encoder after the calculation.' After, not before (as I expected).
  • The existing decoder code is correct. This is how things should function. No change required. While changes are made to the internal state of the decoder if flush is set to true, they don't appear to affect subsequent calls to either GetChars or GetCharCount.

However as we've seen, the existing text is also confusing to people who know just a little bit about how this works. Since we don't want to put all this junk into the docs, perhaps it's enough just to change the places that have a flush parameter to say:

true to simulate clearing the internal state of the encoder after the calculation such as might happen at the end of a stream; otherwise, false.

While it still leaves a lot unsaid, at least it gives a clue about how to think about it. Might be just enough to allow the next guy to puzzle out how this all works.

If this change still makes sense to you, let me know how to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants