Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode filename not recognized well on Windows + PowerShell #174

Open
phu54321 opened this issue Nov 10, 2022 · 8 comments
Open

Unicode filename not recognized well on Windows + PowerShell #174

phu54321 opened this issue Nov 10, 2022 · 8 comments
Labels

Comments

@phu54321
Copy link

image

(base) D:\_BMS> fclones group --cache . | fclones link
[2022-11-11 01:52:29.708] fclones.exe:  info: Started grouping
[2022-11-11 01:59:50.710] fclones.exe:  info: Scanned 6646173 file entries
[2022-11-11 01:59:50.742] fclones.exe:  info: Found 6615777 (712.2 GB) files matching selection criteria
[2022-11-11 01:59:54.307] fclones.exe:  info: Found 6301756 (312.5 GB) candidates after grouping by size
[2022-11-11 01:59:54.724] fclones.exe:  info: Found 6301756 (312.5 GB) candidates after grouping by paths
[2022-11-11 02:33:01.393] fclones.exe:  info: Found 1626299 (120.8 GB) candidates after grouping by prefix
[2022-11-11 02:33:51.369] fclones.exe:  info: Found 1616049 (116.1 GB) candidates after grouping by suffix
[2022-11-11 02:44:27.566] fclones.exe:  info: Found 1430180 (104.8 GB) redundant files
[2022-11-11 02:45:38.262] fclones.exe:  info: Started deduplicating
[2022-11-11 02:45:38.267] fclones.exe: warn: Failed to read metadata of 'D:\_BMS\_etc\ultimate\[???] DistorteD MoonlighT\0.BGA.mpg': Failed to read metadata of 'D:\_BMS\_etc\ultimate\[???] DistorteD MoonlighT\0.BGA.mpg': 파일 이름, 디렉터리 이름 또는 볼륨 레이블 구문이 잘못되었습니다. (os error 123)
[2022-11-11 02:45:38.267] fclones.exe: warn: Failed to read metadata of 'D:\_BMS\BMS OF FIGHTERS\[2012] BOF2012\To Be Coontinued\[???] DistorteD MoonlighT\0.BGA.mpg': 파일 이름, 디렉터리 이름 또는 볼륨 레이블 구문이 잘못되었습니다. (os error 123)
[2022-11-11 02:45:38.267] fclones.exe: warn: Could not determine files to drop in group with hash 4d4f338df94fd9a1c7a1c481c05ac489 and len 187994116: Metadata of some files could not be obtained
[2022-11-11 02:45:38.272] fclones.exe:  info: Processed 3 files and reclaimed 676.6 MB space
[2022-11-11 02:45:38.272] fclones.exe: error: Failed to read file list: Invalid path     D:\\_BMS\\BMS OF FIGHTERS\\[2018] G2R2018\\overground\\Schizophrenicpatients\\[?????????????????????????] ??????????????\movie.mp4
: 120 when decoding DecodeError { kind: UnescapedSlash, index: 120, mat: "\\" } [index=\]

The actual directory name is [π/3] DistorteD MoonlighT and [縺輔°縺阪??縺帙▽縲?縺セ縺医□] 豁サ縺ォ縺溘縺ェ縺. (Yeah that's really a filename) It seems like fclones couldn't recognize Unicode names here.

  • fclones 0.29.1
  • Installed via cargo install fclones, where cargo is installed with rustup.

Thanks

@pkolaczk
Copy link
Owner

Can you please attach the problematic report file with duplicates?

@pkolaczk pkolaczk added the bug Something isn't working label Nov 17, 2022
@c22
Copy link

c22 commented Feb 9, 2023

Here is some additional info for this issue (still present in 0.29.3)

Report file:

# Report by fclones 0.29.3
# Timestamp: 2023-02-09 14:04:44.822 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

Attempt to dedupe:

PS C:\Users\c22\Desktop\DupeTest> fclones.exe group . | fclones link
[2023-02-09 06:05:03.785] fclones.exe:  info: Started grouping
[2023-02-09 06:05:03.788] fclones.exe:  info: Scanned 4 file entries
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-09 06:05:03.789] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-09 06:05:03.790] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-09 06:05:03.791] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-09 06:05:03.791] fclones.exe:  info: Found 2 (8 B) redundant files
[2023-02-09 06:05:03.810] fclones.exe:  info: Started deduplicating
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': Failed to read metadata of 'C:\Users\c22\Desktop\DupeTest\????????????.txt': The filename, directory name, or volume label syntax is incorrect. (os error 123)
[2023-02-09 06:05:03.813] fclones.exe: warn: Could not determine files to drop in group with hash 718ac45066ab06cd8f7d7c20c1ea6d66 and len 4: Metadata of some files could not be obtained
[2023-02-09 06:05:03.813] fclones.exe:  info: Processed 0 files and reclaimed 0 B space

Result is that no files are de-duplicated.

I can possibly take a stab at a fix for this if I get some time.

@pkolaczk
Copy link
Owner

I tested both on Windows in CMD as well as in Wine and it handles the "hamburger" emojis just fine.
However, one thing in common in the problems reported above is PowerShell.

PowerShell/PowerShell#15871

Looks like powershell additionally reinterprets the encoding when the content is piped between two programs. So fclones link doesn't get the same content that was output by fclones group.

@pkolaczk pkolaczk added windows and removed bug Something isn't working labels Feb 14, 2023
@pkolaczk pkolaczk changed the title Unicode filename not recognized well on Windows Unicode filename not recognized well on Windows + PowerShell Feb 14, 2023
@phu54321
Copy link
Author

Weird issue. I'm okay with using cmd, so

  • the documentation should be updated to use cmd on windows.
  • fclones group and fclones link would be most-used combinations of this program, so preferrably link should be usable as a flag to group command, so no piping is necessary.

@pkolaczk
Copy link
Owner

I'm not saying thete is nothing to do here. I'm thinking about a workaround. There are a few things I need to try. Maybe adding a BOM on Windows would help. Or I just escape all non ASCII characters on Windows (or as an option).

@c22
Copy link

c22 commented Feb 15, 2023

Good catch @pkolaczk. Turns out the issue was not what I first thought it would be, but your digging has helped me find a workaround that still allows a user to use PowerShell.

Run [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8 first.

ie.

Without:

PS C:\Users\c22\Desktop\DupeTest> fclones group . | Out-Default
[2023-02-15 11:32:26.973] fclones.exe:  info: Started grouping
[2023-02-15 11:32:26.977] fclones.exe:  info: Scanned 4 file entries
[2023-02-15 11:32:26.977] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-15 11:32:26.978] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-15 11:32:26.978] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-15 11:32:26.988] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-15 11:32:26.989] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-15 11:32:26.990] fclones.exe:  info: Found 2 (8 B) redundant files
# Report by fclones 0.29.3
# Timestamp: 2023-02-15 11:32:26.991 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

With:

PS C:\Users\c22\Desktop\DupeTest> [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
PS C:\Users\c22\Desktop\DupeTest> fclones group . | Out-Default
[2023-02-15 11:32:37.765] fclones.exe:  info: Started grouping
[2023-02-15 11:32:37.770] fclones.exe:  info: Scanned 4 file entries
[2023-02-15 11:32:37.770] fclones.exe:  info: Found 3 (12 B) files matching selection criteria
[2023-02-15 11:32:37.771] fclones.exe:  info: Found 2 (8 B) candidates after grouping by size
[2023-02-15 11:32:37.771] fclones.exe:  info: Found 2 (8 B) candidates after grouping by paths
[2023-02-15 11:32:37.781] fclones.exe:  info: Found 2 (8 B) candidates after grouping by prefix
[2023-02-15 11:32:37.781] fclones.exe:  info: Found 2 (8 B) candidates after grouping by suffix
[2023-02-15 11:32:37.782] fclones.exe:  info: Found 2 (8 B) redundant files
# Report by fclones 0.29.3
# Timestamp: 2023-02-15 11:32:37.783 +1100
# Command: 'C:\Users\c22\.cargo\bin\fclones.exe' group .
# Base dir: C:\\Users\\c22\\Desktop\\DupeTest
# Total: 12 B (12 B) in 3 files in 1 groups
# Redundant: 8 B (8 B) in 2 files
# Missing: 0 B (0 B) in 0 files
718ac45146ab06cd8f7d7c20c1ea6d66, 4 B (4 B) * 3:
    C:\\Users\\c22\\Desktop\\DupeTest\\🍔🍔🍔.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\😊😊😊.txt
    C:\\Users\\c22\\Desktop\\DupeTest\\🤗🤗🤗.txt

There seems to be a documented way to set your system to always use UTF-8 but it sounds like it could have potential compatibility issues.

I wonder if there is a way that fclones could a) detect it's running in PowerShell and b) set that property temporarily.

That might be asking too much, as this really seems more like a PowerShell issue.

@Mikle-Bond
Copy link

Can this theoretically be solved by introducing the -i|--input parameter for link and other commands to specify the file explicitly instead of piping it into stdin?

@Mikle-Bond
Copy link

Meanwhile, I found that Use-RawPipeline module helps. For anyone with a similar issues, here's a temporary workaround: https://github.com/GeeLaw/PowerShellThingies/tree/master/modules/Use-RawPipeline

Setting [Console]::OutputEncoding and [Console]::InputEncoding and $OutputEncoding, as well as changing the codepage didn't help me for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants