Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: Log outputs OEM format, shows strange character encodings for unsupported utf8 characters in file names #1440

Open
ember91 opened this issue Jan 25, 2025 · 2 comments

Comments

@ember91
Copy link

ember91 commented Jan 25, 2025

Describe the bug

Scanning a file with unicode characters in its name outputs the name in a strange encoding that neither is UTF-8 nor the system encoding.

How to reproduce the problem

This was reproduced on a fresh Windows installation of clamav-1.4.2.win.x64.msi with system encoding CP1252.

I created an empty file called file_öταБЬℓσ.txt.

I ran the following Python script in the same directory as the test file I created. Ensure your editor is set to UTF-8. Replace the path to clamscan.exe:

import locale
import subprocess as sp

print('Preferred encoding:', locale.getpreferredencoding())
result = sp.run([r'C:\Program Files\ClamAV\clamscan.exe', 'file_öταБЬℓσ.txt'], stdout=sp.PIPE, stderr=sp.PIPE)
print(result.stdout)

Which outputs (I cut some of the output short with "..."):

Preferred encoding: cp1252
b'C:\\Users\\User\\clamav_unicode_test\\file_\x94\xe7\xe0??l\xe5.txt: ...'

The output text is encoded as CP437, with some characters escaped with "?". Try yourself in Python with b'\x94\xe7\xe0??l\xe5.txt'.decode('cp437'). I have no idea why. I expected it to output it either in UTF-8, which would be for the best, or CP1252 which is the system encoding of my Windows installation.

In the documentation it says that "As a side note, console output (stdin and stderr) will always be OEM encoded, even when redirected to a file.".

Output of PowerShell [System.Text.Encoding]::Default:

IsSingleByte      : True
BodyName          : iso-8859-1
EncodingName      : Western European (Windows)
HeaderName        : Windows-1252
WebName           : Windows-1252
WindowsCodePage   : 1252
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1252

Output from clamconf.exe -n:

Checking configuration files in C:\Program Files\ClamAV

Config file: clamd.conf
-----------------------
ERROR: Please edit the example config file C:\Program Files\ClamAV\clamd.conf

Config file: freshclam.conf
---------------------------
DatabaseMirror = "database.clamav.net"

clamav-milter.conf not found

Software settings
-----------------
Version: 1.4.2
Optional features supported: MEMPOOL AUTOIT_EA06 RAR

Database information
--------------------
Database directory: C:\Program Files\ClamAV\database
WARNING: freshclam.conf and clamd.conf point to different database directories
bytecode.cvd: version 335, sigs: 86, built on Tue Feb 27 16:37:24 2024
daily.cvd: version 27528, sigs: 2072291, built on Fri Jan 24 10:40:27 2025
main.cvd: version 62, sigs: 6647427, built on Thu Sep 16 14:32:42 2021
Total number of signatures: 8719804

Platform information
--------------------
uname: Microsoft Windows 6.2 SP0.0 Build 9200
OS: Windows, ARCH: AMD64, CPU: AMD64
zlib version: 1.3.1 (1.3.1), compile flags: 65
platform id: 0x1025d4d40800000000000794

Build information
-----------------
Microsoft Visual C++: (0.7.148)
sizeof(void*) = 8
Engine flevel: 212, dconf: 212

Interestingly, running the command directly in the PowerShell terminal as & 'C:\Program Files\ClamAV\clamscan.exe' file_öταБЬℓσ.txt presents the output as file_öτα??lσ.txt which probably is due to the Encoder/Decoder best fit fallback as presented above in the output of [System.Text.Encoding]::Default. This can be remedied by running e.g. [Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding("Windows-1252") or [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 right before clamscan.exe.

@ember91 ember91 changed the title Strange character encoding of file name Strange character encoding of file name in output Jan 25, 2025
@ember91
Copy link
Author

ember91 commented Jan 26, 2025

I just looked into this and found out that nothing's wrong. Apparently the OEM code page, whatever that is, is not the same as the system (ANSI) encoding code page. My OEM code page is 437 while the ANSI code page is 1252. One of them is returned by GetACP() and one by GetOEMCP(). Windows is strange I guess.

So you can close this issue when you've read it.

@val-ms
Copy link
Contributor

val-ms commented Jan 29, 2025

Thanks @ember91 I read it and reproduced it. Windows is strange indeed.

It sure would be nice to 'fix' this, though I suppose it is working as documented.

@val-ms val-ms changed the title Strange character encoding of file name in output Windows: Log outputs OEM format, shows strange character encodings for unsupported utf8 characters in file names Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants