Page 1 of 1

Method of 'whitelisting' various characters then replacing any characters in strings that don't match?

Posted: 21 Sep 2019 02:51
by koko
From what I can tell a program I'm sending files to has trouble processing certain Unicode characters in filenames (not special characters or even things like umlauts but for eg alternate Unicode solidus characters). Ideally I thought it would be useful if a whitelist of characters could be defined in batch that could be checked against any variable and if any characters are found in the variable that don't match the whitelist they'd be replaced.

I know the reverse is possible: defining characters one wants to be replaced, but that could be a very long list given how many potential Unicode characters there are.

Is this possible?

Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?

Posted: 21 Sep 2019 19:46
by koko
Just realized I could bypass the output naming directly by the program and instead name it something safe temporarily before using ren to rename the output afterward to the proper string with all characters intact. Obvious in hindsight.

Though if there was such a method as described in the OP I'd certainly still be interested.

Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?

Posted: 22 Sep 2019 01:38
by penpen
koko wrote:
21 Sep 2019 02:51
I know the reverse is possible: defining characters one wants to be replaced, but that could be a very long list given how many potential Unicode characters there are.
Do you know for sure, that the Blacklist is shorter?
How many characters are in that list?
With which characters does that program exactly has issues?
Please use "U+"-notation (so U+002F is the SOLIDUS character '/') and ranges (for example U+0000 - U+1FFFFF) where possible.
koko wrote:
21 Sep 2019 02:51
Is this possible?
It should be possible, but the method highly depends on the characters and their amount in that blacklist.


penpen

Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?

Posted: 22 Sep 2019 03:07
by koko
penpen wrote:
22 Sep 2019 01:38
Do you know for sure, that the Blacklist is shorter? How many characters are in that list?

It should be possible, but the method highly depends on the characters and their amount in that blacklist.
Just so I understand we're referring to the same thing, did you mean 'do you know for sure that the blacklist is longer'? As the initial thinking was whether a whitelist is potentially possible as opposed to a blacklist. (Although in the second post I realized a workaround so in this case I have a functional alternative but was still curious nonetheless).
penpen wrote:
22 Sep 2019 01:38
With which characters does that program exactly has issues?
That's the thing, I'm unsure how many Unicode characters the program has issues with outputting for filenames. I reasoned that if a small subset of characters could be defined in a whitelist (eg: only alphanumeric + special characters or U+0000 – U+007F) then it would at least ensure a known amount of supported characters and any others not defined in that list could be replaced or ignored, though I haven't read about such a thing with batch before.
penpen wrote:
22 Sep 2019 01:38
Please use "U+"-notation (so U+002F is the SOLIDUS character '/') and ranges (for example U+0000 - U+1FFFFF) where possible.
The character I remember noticing issues with was the U+2215 character ( ∕ ) which turned out to be 'division slash' when I checked. If there were a batch test suite one could run to check Unicode ranges supported it would be useful but as a short test on the limited selection of example characters from this page per Unicode range the program could output successfully:

U+0000 – U+007F
U+0080 – U+00FF
U+0100 – U+017F
U+0180 – U+024F
U+0250 – U+02AF
U+02B0 – U+02FF

However I gave up past that point since the page really is rather long :)

Re: Method of 'whitelisting' various characters then replacing any characters in strings that don't match?

Posted: 23 Sep 2019 05:43
by penpen
koko wrote:
22 Sep 2019 03:07
Just so I understand we're referring to the same thing, did you mean 'do you know for sure that the blacklist is longer'?
Sorry, i meant the whitelist... (instead of the blacklist - i leave that error in my above post).
koko wrote:
22 Sep 2019 03:07
That's the thing, I'm unsure how many Unicode characters the program has issues with outputting for filenames. I reasoned that if a small subset of characters could be defined in a whitelist (eg: only alphanumeric + special characters or U+0000 – U+007F) then it would at least ensure a known amount of supported characters and any others not defined in that list could be replaced or ignored, though I haven't read about such a thing with batch before.
Ok, your accepted range U+0000 – U+007F contains 128 characters.
Although that range also contains invalid filename characters, such as the '?'-character (U+003F) and function keys.

You could do something like that (untested; sidenote: Tab and space must be the last characters in that list and space must be the last):

Code: Select all

@echo off
setlocal enableExtensions enableDelayedExpansion
set "valid=0123456789_abcdefghijkjlmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
set "filename=njavcwnA"
set "invalid="
for /f "delims=%valid%" %%a in ("%filename%") do set "invalid=true"

if defined invalid echo("%filename%" contains invalid characters.
goto :eof

koko wrote:
22 Sep 2019 03:07
The character I remember noticing issues with was the U+2215 character ( ∕ ) which turned out to be 'division slash' when I checked. If there were a batch test suite one could run to check Unicode ranges supported it would be useful but as a short test on the limited selection of example characters from this page per Unicode range the program could output successfully:

U+0000 – U+007F
U+0080 – U+00FF
U+0100 – U+017F
U+0180 – U+024F
U+0250 – U+02AF
U+02B0 – U+02FF

However I gave up past that point since the page really is rather long :)
I don't like such exemplary tests, because you often miss cases that fails (such as your U+2215 character which is within U+0180 – U+024F, which you tested successfully).

You could test all characters using a batch script, so you might use something like the following to test blocks of unicode characters:
viewtopic.php?t=7703&start=30#p51606.
Just create a file which name is the small uncode-character-block and if it fails, then output them to a file (for example "fail.txt") and then test the single characters in that block.


penpen