DosTips.com

A Forum all about DOS Batch
It is currently 26 Sep 2017 14:05

All times are UTC-06:00




Post new topic  Reply to topic  [ 31 posts ]  Go to page Previous 1 2 3 Next
Author Message
PostPosted: 09 Dec 2016 04:50 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
Please have a look at the list of code pages:
https://msdn.microsoft.com/en-us/library/dd317756.aspx
There are already code pages like 037, 500, 1026, 1047, 1140-1149. If you still have some EBCDIC data you may do some tests.

Steffen


Top
   
PostPosted: 09 Dec 2016 17:44 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
Rather by a fluke I found a serious bug that could have happened while reading UTF-8. Fixed with v1.3.2.

Steffen


Top
   
PostPosted: 28 Dec 2016 09:04 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
UTF-16 big endian is supported with version 1.4.0 (something that batch can't handle natively). Use code page ID 1201.
Also you can specify the source and destination files directly using options /i and /o. Of course redirections do still work.

Steffen


Top
   
PostPosted: 23 Jan 2017 09:05 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
I did a little code profiling on the weekend. Outcome is that threading of the conversion isn't as important as I expected. It makes more sense to separate reading and writing on the file system because these are slow processes. I changed the behavior in a way that writing is done in a parallel thread while the next chunk of data can be read. Surprisingly I got the best performance results if both converting and writing run together in one thread.
To cut it short: The performance increasement is insignificant but existing. Thus, I'd like to share it by version 1.4.1.

Steffen


Top
   
PostPosted: 26 Jan 2017 11:33 
Offline

Joined: 28 Jun 2010 03:46
Posts: 278
Thanks for the update.

Saso


Top
   
PostPosted: 29 Jan 2017 02:37 
Offline

Joined: 20 Aug 2010 13:57
Posts: 430
Location: Chile
Great tool. I reduced the executable size to 8Kb. Pm sent.


Top
   
PostPosted: 29 Jan 2017 03:30 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
Thank you Carlos!

I will definitely try some of the compiler options in order to reduce the size of the tool. Unfortunately the tool you sent me was immediately removed by Avira (free antivirus) :( There are some good reasons why my tool has a few extra KBs. I'll explain it via PM.

Steffen


Top
   
PostPosted: 29 Jan 2017 08:26 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
I managed to add carlos' size improvements. See comments of the DECREASE_SIZE_GCC macro in the source code. That way the size of the utility was reduced by half (without noticeable performance increasement though).
In order to preserve cross-compiler support I added a few pre-processor directives for retrieving arguments UTF-16-encoded.

Since I don't have any experiences with this kind of size optimizations yet I would like you to report if the new version causes false positives of your antivirus software.

Steffen


Top
   
PostPosted: 30 Jan 2017 11:52 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
After testing at virustotal the executables uploaded with version 1.4.2. do not cause any findings. At least I hope this can be proved in real world, too.

Steffen

https://www.virustotal.com/en/file/a8d6 ... 485797283/
https://www.virustotal.com/en/file/7562 ... 485797365/


Top
   
PostPosted: 02 Feb 2017 12:21 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
With version 1.4.3. comes the feature to add to an existing file using option /a. See the initial post.
Again I checked the executables on virustotal. No false positives detected.
https://www.virustotal.com/en/file/53c0 ... 486055380/
https://www.virustotal.com/en/file/f552 ... 486055433/

As always - the updated file can be found in the initial post of this thread.

Now I'm out of ideas (and am tired reading the source code repeatedly). I'll archive it and leave it alone :wink:

Steffen


Top
   
PostPosted: 23 Feb 2017 11:58 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
Quoted from there:
viewtopic.php?f=3&t=7703&p=51312#p51310
penpen wrote:
I have tested your CONVERTCP utility, and read the source code:
I saw no error, but i noticed that your tool does more, than just converting between codepages - it also approximates characters that are not within the target codepage (which is not that bad, because cmd.exe is doing the same, but i would mention it somewhere).
For example i created a file "string.txt" with this content (i hope it is not corrupted) encoded using UTF-8:
Code: Select all
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩ

If you convert it to codepage 850 you get:
Code: Select all
AaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIi

The recommended behaviour for such cases i know were to use the REPLACEMENT CHARACTER, a question mark, a square, or a question mark in a square for such cases.

This is by design and actually wanted behavior.

1) https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx
Quote:
Code: Select all
int WideCharToMultiByte(
  _In_      UINT    CodePage,
  _In_      DWORD   dwFlags,
  _In_      LPCWSTR lpWideCharStr,
  _In_      int     cchWideChar,
  _Out_opt_ LPSTR   lpMultiByteStr,
  _In_      int     cbMultiByte,
  _In_opt_  LPCSTR  lpDefaultChar,
  _Out_opt_ LPBOOL  lpUsedDefaultChar
);

...
lpDefaultChar [in, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.

lpUsedDefaultChar [out, optional]
...
For the CP_UTF7 and CP_UTF8 settings for CodePage, this parameter must be set to NULL. Otherwise, the function fails with ERROR_INVALID_PARAMETER.
...

That means at least for UTF-7 and UTF-8 I'm not even able to define a default character.
I noted this behavior in my first reply to Dave:
viewtopic.php?f=3&t=7570#p50285

2) The reason why I don't even want to work around it is that the utility was requested by miskox. He told me via email
Quote:
I 'patched' original .exe to make another .exe version with NOCSZ (that is NOČŠŽ) which replaces ČŠŽĐĆ characters with ordinary CZSDC - depending on the input code page.

That's why I called it "wanted behavior".

Steffen


Top
   
PostPosted: 18 Mar 2017 08:15 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
I was asked to add another option in order to automatically replace the original file content with the converted content. I won't do so.

The utility was designed to convert big files. That means it doesn't read the whole content into memory before it begins with the conversion in order to avoid running out of RAM space and to be able to read and convert data in parallel threads. Concurrent access to the same file could cause data losses, especially if the converted data is bigger than the data read.
Of course I could let the tool automatically write to a temporary file and replace the original file after the conversion was finished. But as soon as the temporary file and the original file are saved on different volumes this would cause a physical copying of data which wastes time and resources.

Thus, I would rather keep it in your hands. Moving a file to another file at the same logical drive will only lead to changing the file addressing. Example:
Code: Select all
convertcp 1 65001 /b /i "test.txt" /o "test.txt.temp~"
if not errorlevel 1 move /y "test.txt.temp~" "test.txt"


Steffen


Top
   
PostPosted: 27 May 2017 09:44 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
I didn't like to have only a link to the list of Code Page Identifiers in the help message. That's why I decided to add /l to the supported options that displays a list of installed code pages on your computer together with the information of how they can be used as input code page (see section "additional information" of the initial post), and their description.

Virustotal didn't find any false positives for version 1.4.4.
x86: https://www.virustotal.com/en/file/33108943bf6f8575a49873c44d0eef7ce30ffdd4af7f8564f6c2f8339171581c/analysis/
x64: https://www.virustotal.com/en/file/961bf49a7e624709742cde83ae5739f8e1f949a6e08e0e1a9f29e1f075afa9a4/analysis/

Steffen


Top
   
PostPosted: 25 Sep 2017 16:41 
Offline
Expert

Joined: 12 Feb 2011 21:02
Posts: 1916
Location: United States (east coast)
Great tool Steffen.

I have another option for you - the new JREPL.BAT version 7 features (currently v7.4), can also be used to transform a text file from one encoding to another. I believe it is more restrictive on which character sets can be used because it only supports your machines native code page, plus UTF-16LE, plus code pages that have valid internet character set names. EDIT - Actually it is not that bad. Here is a page that lists code pages along with there internet (.NET) names. Most of the code pages have a valid name

Here is an example that transforms 1252 to UTF-8:
Code: Select all
jrepl "^" "" /f "source.txt|Windows-1252" /o "destination.txt|UTF-8"

But JREPL has a significant advantage in that you can provide custom transformations for source characters that do not exist in the target character set. This could satisfy Sasso's "custom character set" request. This is probably easiest to accomplish by using the JREPL /T option.

One thing that is pretty cool is that with the /X option, you can specify a character using the \xnn escape sequence, where nn is the hexadecimal byte code for the relevant character set. Within a search string it uses the input character set. Within a replacement string it uses the output character set. The \xnn sequence only works properly if the character set is a single byte character set.

With the /T "FILE" option, you can place all your search terms in one file, one per line, and all your replacement (transform) terms in a 2nd file. This helps prevent out of control command line lengths. Another cool feature is you can specify that the search file matches the input character set, and the replacement file matches the output character set.

There is no need for the transformations to involve just single characters. One input character can be transformed into multiple output characters, and vice versa.

Here is an example of what a custom transformation could look like (without specifying the actual custom transformations)
Code: Select all
jrepl "1252to1250find.txt|Windows-1252" "1252to1250repl.txt|Windows-1250" /x /t file /f "source.txt|Windows-1252" /o "destination.txt|Windows-1250"


Dave Benham


Top
   
PostPosted: 26 Sep 2017 10:18 
Offline
Expert

Joined: 22 Jan 2010 18:01
Posts: 2904
Location: Germany
Dave

I'm quite interested in JREPL.BAT as you know. Using ADO streams was a huge improvement. Also for my understanding it's a good alternative for CONVERTCP.

Of course everything has pros and cons. What I really like is that JREPL doesn't need any 3rd party tools. It's something that I can't compete with CONVERTCP. To compensate this deficiency a little I used C and WinAPI (that runs natively and isn't dependent on .NET or Java), I provided the source code (to enable people to read or edit the source and compile the tool by themself) and added a program flow chart (because an executable is like a black box where you can't see the way it works).
On the other hand the main scopes of JREPL and CONVERTCP are quite different. This makes that JREPL is able (and designed) to do customized replacements while CONVERTCP can't do that. But this also makes that CONVERTCP is so much faster for big files (307s vs. 9s for 360MB text Windows-1252 to UTF-8 in my tests) because it efficiently converts and writes in parallel threads. Converting big files was one of Saso's original requirements.

I don't want to make a fuss about CONVERTCP. Initially it was a gift to Saso who suggested to publish it. So why not :wink: It seems to be helpful for some people. Version 1.4.4 was downloaded ~120 times now. That's approx. once a day and thus, maybe 10 times more than I would have ever expected but isn't comparable to what JREPL catches on :wink: At the end there is no "nec plus ultra". The users have to decide which meets their needs best. For that reason I really appreciate that you left some notes and a link to JREPL.BAT in this thread. This will give the users more opportunities to find the right tool for their tasks :D

Steffen


Top
   
Display posts from previous:  Sort by  
Post new topic  Reply to topic  [ 31 posts ]  Go to page Previous 1 2 3 Next

All times are UTC-06:00


Who is online

Users browsing this forum: Google [Bot] and 16 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Limited