Using many "tokens=..." in FOR /F command in a simple way

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#16 Post by penpen » 20 Feb 2017 13:00

Basing on this result i guess, that you both use codepage 850 and the the order depends on UTF-32/UTF-16/UCS2:
- CP_850(173) = CP_850(0xAD) = U+00A1
- CP_850(189) = CP_850(0xBD) = U+00A2
- CP_850(156) = CP_850(0x9C) = U+00A3
- CP_850(207) = CP_850(0xCF) = U+00A4
- (i haven't checked more)


penpen

Edit: Corrected ome flaws.
Edit2: Added the other two possibilities.

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#17 Post by aGerman » 20 Feb 2017 13:14

Yes Antonio. It works for me, too.

Code: Select all


Thread 1:
173 189 156 207 190 221 245 249 184 166 174 170 240 169 238 248 241 253 252 239 230 244 250
247 251 167 175 172 171 243 168 183 181 182 199 142 143 146 128 212 144 210 211 222 214 215
216 209 165 227 224 226 229 153 158 157 235 233 234 154 237 232 225 133 160 131 198 132 134
145 135 138 130 136 137 141 161 140 139 208 164 149 162 147 228 148 246 155 151 163 150 129
236 231 152
Thread 2:
159
Thread 3:
176 177 178
Thread 4:
179
Thread 5:
180
Thread 6:
185
Thread 7:
186
Thread 8:
187
Thread 9:
188
Thread 10:
191
Thread 11:
192
Thread 12:
193
Thread 13:
194
Thread 14:
195
Thread 15:
196
Thread 16:
197
Thread 17:
200
Thread 18:
201
Thread 19:
202
Thread 20:
203
Thread 21:
204
Thread 22:
205
Thread 23:
206
Thread 24:
213
Thread 25:
217
Thread 26:
218
Thread 27:
219
Thread 28:
220
Thread 29:
223
Thread 30:
242
Thread 31:
254
Thread 32:
255

Code: Select all

tokens=1,20,45,75,120
 A1 A20 A45 A75 A120
 B1 B20 B45 B75 B120
 C1 C20 C45 C75 C120

tokens=30,28-32,170-165
 A30 A28 A29 A30 A31 A32 A170 A169 A168 A167 A166 A165
 B30 B28 B29 B30 B31 B32 B170 B169 B168 B167 B166 B165
 C30 C28 C29 C30 C31 C32 C170 C169 C168 C167 C166 C165

tokens=


@penpen

This sounds quite logical (even if it's hard to believe that the cmd works with UTF-32 rather than UTF-16 :wink: ). What would be your suggestion then? Having a UTF-32-encoded file and somehow read the characters out of it and convert them to the current code page?

Steffen

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#18 Post by penpen » 20 Feb 2017 13:36

aGerman wrote:This sounds quite logical (even if it's hard to believe that the cmd works with UTF-32 rather than UTF-16 :wink: ).
Yes, UTF32 is hard to believe... i've added UTF-16/UCS-2 above.
The most probable now is UCS-2.

aGerman wrote:What would be your suggestion then? Having a UTF-32-encoded file and somehow read the characters out of it and convert them to the current code page?
I'm unsure... depends on the purpose i think.
If you want to check this for all values, then i would probably use java to create the sourcefile and would use codepage 65001 within the source to create all needed codepoints.


penpen

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#19 Post by aGerman » 20 Feb 2017 13:53

penpen wrote:i've added UTF-16/UCS-2 above.
The most probable now is UCS-2.

I stick with UTF-16. But that shouldn't make any difference here.

penpen wrote:depends on the purpose i think.

Well the purpose is to find the order of FOR variable names for the current OEM code page. My idea is to somehow work with TYPE and CMD /U.

Steffen

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#20 Post by penpen » 20 Feb 2017 14:53

aGerman wrote:
penpen wrote:i've added UTF-16/UCS-2 above.
The most probable now is UCS-2.

I stick with UTF-16. But that shouldn't make any difference here.
Theoretically there might be a difference:
The sort order of characters within UTF-16 depends on the UTF-32 codepoint value, while the sort order of UCS-2 depends on their (16 bit) index.
So for two indices c, and (c+1) the characters UCS_2(c) and UCS_2(c+1) are successive in UCS-2, but in UTF-16 UTF_16(c) and UTF_16(c+1) are not.
But i don't know, if this applies to any OEM codepage.

aGerman wrote:Well the purpose is to find the order of FOR variable names for the current OEM code page. My idea is to somehow work with TYPE and CMD /U.
I have no idea how to deal with double/multibyte codepages.
If you use single byte coddepages, then i would create a table with ansi values 01-255 (table.dat), and load it whenever you change the codepage using:

Code: Select all

set "table="
set /P "table=" < "table.dat"
Then i would use "cmd /u" to output the all characters to a file, and use "fc /b" against a null bytes to get their indices (storing each for example in environment variable 'set "c_<2 byte index in array>=<cmd /u index>"').
Then just sort by index using "sort +4".
(Somehow like that or similar.)


penpen

Thor
Posts: 43
Joined: 31 Mar 2016 15:02

Re: Using many "tokens=..." in FOR /F command in a simple way

#21 Post by Thor » 20 Feb 2017 15:07

Hi Antonio,

I've used your latest ""FOR-F with many tokens - SP.bat" file and
change the codepage to 850 and it works flawlessly. Don't know why.

tokens=170-180
A170 A171 A172 A173 A174 A175 A176 A177
B170 B171 B172 B173 B174 B175 B176 B177
C170 C171 C172 C173 C174 C175 C176 C177

tokens=180-170
A177 A176 A175 A174 A173 A172 A171 A170
B177 B176 B175 B174 B173 B172 B171 B170
C177 C176 C175 C174 C173 C172 C171 C170

tokens=


Whereas my default codepage is 437 and it has some weird characters appears.

tokens=170-180
A86 %∞ %τ
B86 %∞ %τ
C86 %∞ %τ

tokens=180-170
%τ %∞ A86
%τ %∞ B86
%τ %∞ C86

tokens=

I'm using Windows 8.1 Pro 64-bit US version

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#22 Post by aGerman » 20 Feb 2017 15:27

penpen wrote:But i don't know, if this applies to any OEM codepage.

Maybe for Chinese. I don't know. I'm quite familiar with UTF-16 and its surrogate concept. The reason why I assume that the cmd deals with UTF-16 is that 1) Windows deals with it 2) I found the references of MultiByteToWideChar and WideCharToMultiByte API functions in cmd.exe.

penpen wrote:If you use single byte coddepages, then i would create a table with ansi values 01-255 (table.dat), and load it whenever you change the codepage using:

Code: Select all

set "table="
set /P "table=" < "table.dat"
Then i would use "cmd /u" to output the all characters to a file, and use "fc /b" against a null bytes to get their indices (storing each for example in environment variable 'set "c_<2 byte index in array>=<cmd /u index>"').
Then just sort by index using "sort +4".
(Somehow like that or similar.)

Yes that's basically what I try to do. It's a bit tricky because of the Little Endianness of the wide characters.

Steffen

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#23 Post by aGerman » 20 Feb 2017 16:44

Is this something you could work with Antonio?
The code creates file "u-a.txt" with the unicode values (hex) and the related OEM char code (dec).

Steffen

Code: Select all

@echo off &setlocal
:: create base64 code
 >"tmp1" echo(gIGCg4SFhoeIiYqLjI2Oj5CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8
>>"tmp1" echo(DBwsPExcbHyMnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=

:: decode to bytes 128 to 255
>nul certutil.exe -f -decode "tmp1" "tmp2"

:: convert the characters represented by these bytes to unicode
>"tmp1" cmd /q /u /c "type "tmp2""

:: create a file with 256 'A's for comparisons using FC
>"tmp2" (for /l %%i in (1 1 4) do <nul set /p "=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

:: create the HEX dump
setlocal EnableDelayedExpansion
set "X=1"
>"dmp" (
  for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "tmp1" "tmp2"^|findstr /vbi "FC:"') do (
    set /a "Y=0x%%i"
    for /l %%k in (!X! 1 !Y!) do echo 41
    set /a "X=Y+2"
    echo %%j
  )
)
del "tmp1"

:: combine hex values in BE order with char codes of the related character of the OEM code page
<"dmp" >"tmp2" (
  for /l %%i in (128 1 255) do (
    set /p "low=" &set /p "high="
    echo !high!!low! %%i
  )
)
del "dmp"

:: sort
sort "tmp2" /o "u-a.txt"
del "tmp2"


//EDIT Similar, prints the characters...

Code: Select all

@echo off &setlocal
:: create base64 code
 >"tmp1" echo(gIGCg4SFhoeIiYqLjI2Oj5CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8
>>"tmp1" echo(DBwsPExcbHyMnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=

:: decode to bytes 128 to 255
>nul certutil.exe -f -decode "tmp1" "tmp2"

:: save them in a variable
<"tmp2" set /p "chars="

:: convert the characters represented by these bytes to unicode
>"tmp1" cmd /q /u /c "type "tmp2""

:: create a file with 256 'A's for comparisons using FC
>"tmp2" (for /l %%i in (1 1 4) do <nul set /p "=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

:: create the HEX dump
setlocal EnableDelayedExpansion
set "X=1"
>"dmp" (
  for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "tmp1" "tmp2"^|findstr /vbi "FC:"') do (
    set /a "Y=0x%%i"
    for /l %%k in (!X! 1 !Y!) do echo 41
    set /a "X=Y+2"
    echo %%j
  )
)
del "tmp1"

:: combine hex values in BE order with indexes of the related OEM characters in %chars%
<"dmp" >"tmp2" (
  for /l %%i in (0 1 127) do (
    set /p "low=" &set /p "high="
    echo !high!!low! %%i
  )
)
del "dmp"

:: sort
sort "tmp2" /o "index.txt"
del "tmp2"

:: print the characters
for /f "usebackq tokens=2" %%i in ("index.txt") do echo "!chars:~%%i,1!"
pause

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#24 Post by penpen » 20 Feb 2017 18:30

@aGerman:
You shouldn't use "type" to convert those characters to UTF-16/UCS-2:
There may be undefined characters in a codepage, that might be (depending on the font used) mapped (unexpectedly) to a surrogate pair.

Also the A file might be too short (in worst case 512 bytes are needed for the 128 characters if all are surrogate pairs).


Although this is slower, you better should do it characterwise (requires table.dat file with bytes [20, 01 : FF]):

Code: Select all

@echo off
cls
setlocal enableExtensions disableDelayedExpansion
if not "%~1" == "" ( set "codepage=%~1" ) else set "codepage=850"
for /f "tokens=2 delims=:." %%a in ('chcp') do set "cp=%%~a"

>nul fsutil file createnew "zero.txt" 4

>nul chcp %codepage%
set "table="
<"table.dat" set /P "table="


setlocal enableDelayedExpansion


for /l %%a in (0x20, 1, 0xFF) do (
   set /a "index=1000+%%~a"
   set "index=!index:~1!"
   cmd /e:ON /v:ON /d /u /c">"dummy.txt" echo(^!table:~%%~a,1^!"
   for /l %%b in (0, 1, 3) do set "b_0000000%%~b=00"
   for %%b in ("dummy.txt") do set /a "bytes=%%~zb-4"
   for /f "tokens=1,3 delims=: " %%b in ('fc /b "zero.txt" "dummy.txt" ^| findstr "0" ') do if %%~b lss !bytes! set "b_%%~b=%%~c"
   set "char_!index!=!table:~%%~a,1!"
   if !bytes! == 2 ( set "cp_!index!=0x!b_00000001!!b_00000000!"
   ) else            set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
)
:: Result:
:: =======
:: Basic Multilingual Plane characters; single code units:
set "cp_" | 2>nul findstr /V "\," | sort /+7
::
:: Supplementary characters; surrogate pairs:
set "cp_" | 2>nul findstr "\," | sort /+7
::
:: referenced characters
:: set "char_"

:: creating order (with "holes")
::                   32 spaces
set "order=                                "
for /f "tokens=2 delims=_=" %%a in ('^(set "cp_" ^| 2^>nul findstr /V "\," ^| sort /+7^)^&^(set "cp_" ^| 2^>nul findstr "\," ^| sort /+7^)') do (
   set "order=!order!!char_%%~a!"
)
set order

endlocal
del "zero.txt", "dummy.txt"

>nul chcp %cp%
endlocal


penpen

Edit: Corrected some flaws.
Edit: Corrected the byte order of the surrogate pair: Thanks to aGerman for finding this bug.

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#25 Post by aGerman » 20 Feb 2017 19:08

So are you saying we should take care about surrogates :?: :lol:
I see your point though. I agree that we should rather compare 4 bytes each. Although I'm a little confused. Surrogates are pairs of two bytes each. Looking at your code it seems you revert all 4 bytes. Shouldn't it read
set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
Maybe I'm missing something...

Steffen

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#26 Post by penpen » 20 Feb 2017 20:25

aGerman wrote:So are you saying we should take care about surrogates :?: :lol:
Well, i don't know if one of the oem codepages is using one surrogate pair, so one better should take care just in case letters outer the BMP are in use.
Also the code could more easily be extended to any codepages (if ever needed).

aGerman wrote:Shouldn't it read
set "cp_!index!=0x!b_00000001!!b_00000000!,0x!b_00000003!!b_00000002!"
You're absolutely right!
Actually it is too late... tonight.


penpen

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#27 Post by aGerman » 21 Feb 2017 17:24

I tried to improve your code a little. (Without much success though.)

First of all I used CERTUTIL in order to create table.dat from scratch. The MAKECAB technique works great and is downward-compatible but it is terribly slow. The same for zero.txt because FSUTIL requires elevation on Win7 downwards.
I kept the leading 1 of the index in order to avoid unnecessary string manipulations.
I removed the FINDSTR filter for the output of FC.
Even if the bytes read don't represent a surrogate pair it would be okay to leave zero bytes at the end of the hex string. That way you don't need to distiguish between BMP and surrogates.

However the repeated calls of CMD and FC still take a lot of time :| I wonder if there is a real risk for an appearance of surrogates...

Steffen

Code: Select all

@echo off
cls

setlocal enableExtensions disableDelayedExpansion
>"dummy.txt" (
  echo(IAECAwQFBgcICQoLDA0ODxAREhMUFRYXGBkaGxwdHh8gISIjJCUmJygpKissLS4vMDEyMzQ1Njc4OTo7PD0+P0
  echo(BBQkNERUZHSElKS0xNTk9QUVJTVFVWV1hZWltcXV5fYGFiY2RlZmdoaWprbG1ub3BxcnN0dXZ3eHl6e3x9fn+A
  echo(gYKDhIWGh4iJiouMjY6PkJGSk5SVlpeYmZqbnJ2en6ChoqOkpaanqKmqq6ytrq+wsbKztLW2t7i5uru8vb6/wM
  echo(HCw8TFxsfIycrLzM3Oz9DR0tPU1dbX2Nna29zd3t/g4eLj5OXm5+jp6uvs7e7v8PHy8/T19vf4+fr7/P3+/w==
)
>nul certutil.exe -f -decode "dummy.txt" "table.dat"

>"dummy.txt" echo(AAAAAA==
>nul certutil.exe -f -decode "dummy.txt" "zero.txt"

if not "%~1" == "" ( set "codepage=%~1" ) else set "codepage=850"
for /f "tokens=2 delims=:." %%a in ('chcp') do set "cp=%%~a"

>nul chcp %codepage%
set "table="
<"table.dat" set /P "table="


setlocal enableDelayedExpansion


for /l %%a in (0x20, 1, 0xFF) do (
   set /a "index=1000+%%~a"
   cmd /e:ON /v:ON /d /u /c ">"dummy.txt" echo(^!table:~%%~a,1^!"
   for /l %%b in (0, 1, 3) do set "b_0000000%%~b=00"
   for %%b in ("dummy.txt") do set /a "bytes=%%~zb-4"
   for /f "skip=1 tokens=1,3 delims=: " %%b in ('fc /b "zero.txt" "dummy.txt"') do if %%~b lss !bytes! set "b_%%~b=%%~c"
   set "char_!index!=!table:~%%~a,1!"
   set "cp_!index!=!b_00000001!!b_00000000!!b_00000003!!b_00000002!"
)

:: Result:
:: =======

:: set "cp_" | sort /+8
::
:: referenced characters
:: set "char_"

:: creating order (with "holes")
::                   32 spaces
set "order=                                "
for /f "tokens=2 delims=_=" %%a in ('set "cp_" ^| sort /+8') do (
   set "order=!order!!char_%%~a!"
)
set order

endlocal
del "zero.txt", "dummy.txt", "table.dat"

>nul chcp %cp%
endlocal
pause

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#28 Post by penpen » 21 Feb 2017 20:31

aGerman wrote:I wonder if there is a real risk for an appearance of surrogates...
Yes, there is a risk, and indeed i have seen some custom codepages where someone used SYMBOL G CLEF (U+1D11E)... .
But to be honest it is recommended to use the REPLACEMENT CHARACTER (which is in the BMP) for such cases - so this risk might not be that big for codepages created by Microsoft.
A higher risk should be, that characters get lost, if an undefine code unit is detected in a multibyte character set (which actually is not your goal - so this shouldn't happen).


I also rethought your usage of "type" to convert multiple characters at once to UCS-2/UTF-16LE, and reread into surrogate pairs.
It is not that bad that i thought in the first place:
You could detect surrogate pairs (although i've always avoided assuming anything on UTF-16 characters, so i didnt remembered it - sorry for that).
(Maybe it was also too late yesterday... same holds for now... so gn8 :) .)

Code: Select all

:isSurrogate
:: %~1   contains the code unit in hex (example "0xDF12")
:: @returns 0 if %~1 is no surrogate, 1 if it is low surrogate and 3 if it is a high surrogate code unit.
if 0xD800 leq %~1 (
  if %~1 leq 0xDBFF ( exit /b 3
  ) else if %~1 leq 0xDFFF exit /b 1
)
exit /b 0
So in case you see any high surrogate code unit, the next must be a low surrogate one:
And you could use your above method (although still not recommended for multibyte character sets).


penpen

aGerman
Expert
Posts: 3779
Joined: 22 Jan 2010 18:01
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#29 Post by aGerman » 22 Feb 2017 15:03

penpen wrote:You could detect surrogate pairs

I don't know if you read the source code of my CONVERTCP utility. There I have to detect surrogates as well in oder to make sure the whole pair is in a chunk of data read.

penpen wrote:(Maybe it was also too late yesterday... same holds for now... so gn8 :) .)

Don't worry. That happens to me every day. I'm a night owl. You don't want to see me getting up in the morning :lol:

May you have a look at this code. I think a text comparison with DC for the high byte should be sufficient.

Steffen

Code: Select all

@echo off &setlocal enableExtensions disableDelayedExpansion

:: 0x20 ... 0xFF
>"dummy.txt" (
  echo(ICEiIyQlJicoKSorLC0uLzAxMjM0NTY3ODk6Ozw9Pj9AQUJDREVGR0hJSktMTU5PUFFSU1RVVld
  echo(YWVpbXF1eX2BhYmNkZWZnaGlqa2xtbm9wcXJzdHV2d3h5ent8fX5/gIGCg4SFhoeIiYqLjI2Oj5
  echo(CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8DBwsPExcbHy
  echo(MnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
)
>nul certutil -f -decode "dummy.txt" "table.dat"

:: 896 zero bytes
>"dummy.txt" (
  for /l %%i in (1 1 18) do echo(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  echo(AAAAAAA=
)
>nul certutil -f -decode "dummy.txt" "zero.dat"

set "table="
<"table.dat" set /P "table="

>"dummy.txt" cmd /d /q /u /c "type "table.dat""

:: number of double bytes
for %%b in ("dummy.txt") do set /a "i=%%~zb>>1"

setlocal enableDelayedExpansion


>"dump.txt" (
  for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
    set /a "Y=0x%%i"
    for /l %%k in (!X! 1 !Y!) do echo 00
    set /a "X=Y+2"
    echo %%j
  )
  echo 00
)

set /a "n=1032"
<"dump.txt" (
  for /l %%i in (1 1 %i%) do (
    set /p "low=" &set /p "high="
    if !high! geq DC ( REM second double byte of a surrogate pair
      for %%j in (!n!) do set "cp_%%j=!cp_%%j:~,4!!high!!low!"
    ) else ( REM BMP or first double byte of a surrogate pair
      set "cp_!n!=!high!!low!0000"
      set /a "idx=n-1032"
      for %%j in (!idx!) do set "char_!n!=!table:~%%j,1!"
      set /a "n+=1"
    )
  )
)

:: Result:
:: =======

set "cp_" | sort /+8

:: creating order (with "holes")
::                   32 spaces
set "order=                                "
for /f "tokens=2 delims=_=" %%a in ('set "cp_" ^| sort /+8') do (
   set "order=!order!!char_%%~a!"
)
set order

del "dummy.txt" "table.dat" "zero.dat" "dump.txt"
pause

penpen
Expert
Posts: 1729
Joined: 23 Jun 2013 06:15
Location: Germany

Re: Using many "tokens=..." in FOR /F command in a simple way

#30 Post by penpen » 22 Feb 2017 19:11

aGerman wrote:I don't know if you read the source code of my CONVERTCP utility. There I have to detect surrogates as well in oder to make sure the whole pair is in a chunk of data read.
No, i didn't up to now.

aGerman wrote:May you have a look at this code. I think a text comparison with DC for the high byte should be sufficient.
Mo, definitely not; according to the java documentation UTF-16 is a little bit "ugly" there.

Code: Select all

1.no surrogate in [0x0000 : 0xD7FF]
high surrogate in [0xD800 : 0xDBFF]
low  surrogate in [0xDC00 : 0xDFFF]
2.no surrogate in [0xE000 : 0xFFFF]
All non surrogate code units in [0xE000 : 0xFFFF] would be treated as low surrogates.
Your assignment in code may fail on such code units (example: "FULLWIDTH NOT SIGN" U+FFE2).
It is also true that "non surrogates < surrogate pairs" so you should first list and sort all non surrogates, and then all surrogate pairs (if the order depends on UTF-16; i don't know exactly how UCS-2 is sorted according to surrogate pairs).

aGerman wrote:

Code: Select all

>"dump.txt" (
  for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
    set /a "Y=0x%%i"
    for /l %%k in (!X! 1 !Y!) do echo 00
    set /a "X=Y+2"
    echo %%j
  )
  echo 00
)
This part may be risky, if the the first hex value not equals "00"; also the last "00" may be unneeded.

So may suggestion is something like that (hopefully i haven't messed anything up):

Code: Select all

@echo off
setlocal enableExtensions disableDelayedExpansion

:: 0x20 ... 0xFF
>"dummy.txt" (
   echo(ICEiIyQlJicoKSorLC0uLzAxMjM0NTY3ODk6Ozw9Pj9AQUJDREVGR0hJSktMTU5PUFFSU1RVVld
   echo(YWVpbXF1eX2BhYmNkZWZnaGlqa2xtbm9wcXJzdHV2d3h5ent8fX5/gIGCg4SFhoeIiYqLjI2Oj5
   echo(CRkpOUlZaXmJmam5ydnp+goaKjpKWmp6ipqqusra6vsLGys7S1tre4ubq7vL2+v8DBwsPExcbHy
   echo(MnKy8zNzs/Q0dLT1NXW19jZ2tvc3d7f4OHi4+Tl5ufo6err7O3u7/Dx8vP09fb3+Pn6+/z9/v8=
)
>nul certutil -f -decode "dummy.txt" "table.dat"

:: 896 zero bytes
>"dummy.txt" (
   for /l %%i in (1 1 18) do echo(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
   echo(AAAAAAA=
)
>nul certutil -f -decode "dummy.txt" "zero.dat"

set "table="
<"table.dat" set /P "table="

>"dummy.txt" cmd /d /q /u /c "type "table.dat""

:: number of double bytes
for %%b in ("dummy.txt") do set /a "i=%%~zb>>1"

setlocal enableDelayedExpansion

cls

>"dump.txt" (
   set "X=1"
   for /f "skip=1 tokens=1,2 delims=: " %%i in ('fc /b "dummy.txt" "zero.dat"^|findstr /vbi "FC:"') do (
      set /a "Y=0x%%i"
      for /l %%k in (!X! 1 !Y!) do echo 00
      set /a "X=Y+2"
      echo %%j
   )
   set /A "Y=i<<1"
   for /l %%k in (!X! 1 !Y!) do echo 00
)

set /a "n=1032"
<"dump.txt" (
   for /l %%i in (1 1 %i%) do (
      set /p "low=" &set /p "high="
      if 0xDC leq 0x!high! (
         if 0x!high! leq 0xDF ( set "isLowSurogate=1"
         ) else set "isLowSurogate="
      ) else set "isLowSurogate="
      if defined isLowSurogate ( REM second double byte of a surrogate pair
         for %%j in (!n!) do set "cp_%%j=!cp_%%j!!high!!low!"
      ) else ( REM BMP or first double byte of a surrogate pair
         set "cp_!n!=!high!!low!"
         set /a "idx=n-1032"
         for %%j in (!idx!) do set "char_!n!=!table:~%%j,1!"
         set /a "n+=1"
      )
   )
)

:: Result:
:: =======
:: Basic Multilingual Plane characters; single code units:
set "cp_" | 2>nul findstr /V "\=........" | sort /+8
:: Supplementary characters; surrogate pairs:
set "cp_" | 2>nul findstr    "\=........" | sort /+8

:: creating order (with "holes")
::                     32 spaces
set "order=                                "
for /f "tokens=2 delims=_=" %%a in ('^(set "cp_" ^| findstr /V "\=........" ^| sort /+8^)^&^(set "cp_" ^| 2^>nul findstr "\=........" ^| sort /+8^)') do (
   set "order=!order!!char_%%~a!"
)
set order

rem del "dummy.txt" "table.dat" "zero.dat" "dump.txt"
pause

If you want to do that for any other codepage, too, then we need to find out, how to list all character units in a codepage.
(Sad to say, actually i only have a rough idea for old DOS codepages, how to one could get such information.)


penpen

Post Reply