From what I can tell a program I'm sending files to has trouble processing certain Unicode characters in filenames (not special characters or even things like umlauts but for eg alternate Unicode solidus characters). Ideally I thought it would be useful if a whitelist of characters could be defined in batch that could be checked against any variable and if any characters are found in the variable that don't match the whitelist they'd be replaced.
I know the reverse is possible: defining characters one wants to be replaced, but that could be a very long list given how many potential Unicode characters there are.
Is this possible?
Method of 'whitelisting' various characters then replacing any characters in strings that don't match?
Moderator: DosItHelp
Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?
Just realized I could bypass the output naming directly by the program and instead name it something safe temporarily before using ren to rename the output afterward to the proper string with all characters intact. Obvious in hindsight.
Though if there was such a method as described in the OP I'd certainly still be interested.
Though if there was such a method as described in the OP I'd certainly still be interested.
Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?
Do you know for sure, that the Blacklist is shorter?
How many characters are in that list?
With which characters does that program exactly has issues?
Please use "U+"-notation (so U+002F is the SOLIDUS character '/') and ranges (for example U+0000 - U+1FFFFF) where possible.
It should be possible, but the method highly depends on the characters and their amount in that blacklist.
penpen
Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?
Just so I understand we're referring to the same thing, did you mean 'do you know for sure that the blacklist is longer'? As the initial thinking was whether a whitelist is potentially possible as opposed to a blacklist. (Although in the second post I realized a workaround so in this case I have a functional alternative but was still curious nonetheless).
That's the thing, I'm unsure how many Unicode characters the program has issues with outputting for filenames. I reasoned that if a small subset of characters could be defined in a whitelist (eg: only alphanumeric + special characters or U+0000 – U+007F) then it would at least ensure a known amount of supported characters and any others not defined in that list could be replaced or ignored, though I haven't read about such a thing with batch before.
The character I remember noticing issues with was the U+2215 character ( ∕ ) which turned out to be 'division slash' when I checked. If there were a batch test suite one could run to check Unicode ranges supported it would be useful but as a short test on the limited selection of example characters from this page per Unicode range the program could output successfully:
U+0000 – U+007F
U+0080 – U+00FF
U+0100 – U+017F
U+0180 – U+024F
U+0250 – U+02AF
U+02B0 – U+02FF
However I gave up past that point since the page really is rather long
Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?
Sorry, i meant the whitelist... (instead of the blacklist - i leave that error in my above post).
Ok, your accepted range U+0000 – U+007F contains 128 characters.koko wrote: ↑22 Sep 2019 03:07That's the thing, I'm unsure how many Unicode characters the program has issues with outputting for filenames. I reasoned that if a small subset of characters could be defined in a whitelist (eg: only alphanumeric + special characters or U+0000 – U+007F) then it would at least ensure a known amount of supported characters and any others not defined in that list could be replaced or ignored, though I haven't read about such a thing with batch before.
Although that range also contains invalid filename characters, such as the '?'-character (U+003F) and function keys.
You could do something like that (untested; sidenote: Tab and space must be the last characters in that list and space must be the last):
Code: Select all
@echo off
setlocal enableExtensions enableDelayedExpansion
set "valid=0123456789_abcdefghijkjlmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
set "filename=njavcwnA"
set "invalid="
for /f "delims=%valid%" %%a in ("%filename%") do set "invalid=true"
if defined invalid echo("%filename%" contains invalid characters.
goto :eof
I don't like such exemplary tests, because you often miss cases that fails (such as your U+2215 character which is within U+0180 – U+024F, which you tested successfully).koko wrote: ↑22 Sep 2019 03:07The character I remember noticing issues with was the U+2215 character ( ∕ ) which turned out to be 'division slash' when I checked. If there were a batch test suite one could run to check Unicode ranges supported it would be useful but as a short test on the limited selection of example characters from this page per Unicode range the program could output successfully:
U+0000 – U+007F
U+0080 – U+00FF
U+0100 – U+017F
U+0180 – U+024F
U+0250 – U+02AF
U+02B0 – U+02FF
However I gave up past that point since the page really is rather long
You could test all characters using a batch script, so you might use something like the following to test blocks of unicode characters:
viewtopic.php?t=7703&start=30#p51606.
Just create a file which name is the small uncode-character-block and if it fails, then output them to a file (for example "fail.txt") and then test the single characters in that block.
penpen