CONVERTCP.exe - Convert text from one code page to another

Message

#46 Post by **dbenham** » 14 Apr 2018 08:19

aGerman wrote: ↑
14 Apr 2018 07:07
Right now I discovered that the output of CONVERTCP and JREPL (as well as using various text editors) are still different when I converted Lubomir's file.
CONVERTCP does not change line endings automatically. E.g. things like double line feeds (LF LF) can be found in Lubomir's file. CONVERTCP leaves it as LF LF while other software may automatically convert it to CR LF CR LF.

Regarding JREPL - It depends on the options used.

By default, all End-Of-Lines (EOLs) will be written as CR LF

With the /U option, all EOLs will be written as LF

Both of the above are processed one line at a time, so JREPL can process arbitrarily large files.

With the /M option, all original EOLs will be preserved (unless explicitly modified by the find/replace of course)

The /M option loads the entire file into memory, so it is limited to what can fit. I believe the practical limit is somewhere around 1 GB.

The performance of compiled CONVERTCP will always blow the pants off the interpreted JREPL

But JREPL can be extremely useful if you want to change content at the same time as changing the encoding, or if you want to take control of inexact translations when the destination encoding is missing a source character.

Dave Benham

#47 Post by **aGerman** » 14 Apr 2018 08:28

Thanks Dave! I hoped you wanted to clarify how to avoid that effect using JREPL

(I would have pointed you via PM otherwise.)

Steffen

#48 Post by **aGerman** » 15 Apr 2018 05:49

I added a "Known issues" paragraph to my initial post.

Steffen

#49 Post by **aGerman** » 18 Apr 2018 15:54

After a long run of converting files in a loop and comparisions using FC /B I still got corrupted files every now and then. Days of unsuccessful investigations, dozens of internet articles, every possible settings, changing the chunks to the size of a drive sector ... nothing helped. It really drove me crazy

Eventually I passed on to random trial and error tests. Moving the character conversion from the thread function to the reading loop in the main routine solved the issue even if it is still not logical to me. What I found during my research was that flushing the buffer is needless as long as no concurrent writing or reading actions are in place. So forget about my former explanatios about the technical reason of the bug

Steffen

Virustotal scans of version 2.2:
x86: https://www.virustotal.com/en/file/84a4 ... /analysis/
x64: https://www.virustotal.com/en/file/726b ... /analysis/

#50 Post by **aGerman** » 20 Apr 2018 10:24

I added option /f to version 3.0. This leads to flushing of the output file buffer before the file will be closed.

Some explanations about the reason for the new option:

As written above flushing is basically not needed. Written data will be buffered by the file system because the physical writing to the drive is slow. That way the performance can be improved enormously. There is no drawback as long as the file is not accessed concurrently. Appending new data to the buffer by another write operation will not corrupt the data in the buffer. That's the reason why the default behavior of CONVERTCP is that the buffer will not be flushed automatically, also not before the file was closed. Converting multiple files in a loop keeps being very fast in that case.

This having said, there is still a risk just in case you immediately access the new file while there might be yet unwritten data in the buffer. E.g. if you convert a file using CONVERTCP in a script and you want to process the new file in the very next line of the script then you might get trouble because physical writing of the data to the drive may still take a few hundreds of milliseconds (even if the file was already closed). Option /f forces the buffer flushing before CONVERTCP terminates which should protect you from unwritten data. Even though this doesn't necessarily mean that the data was already written to the physical drive memory. Drives may have an additional buffer and it depends on the driver settings whether or not the request of Windows to flush that additional buffer will be ignored. The latter is nothing that I'm even able to influence and this possible behavior of a drive would also cause issues for other programs that write data to the drive.

Steffen

Virustotal scans of version 3.0:
x86: https://www.virustotal.com/en/file/11b5 ... /analysis/
x64: https://www.virustotal.com/en/file/5dfd ... /analysis/

#51 Post by **aGerman** » 20 Apr 2018 13:16

Carlos wrote me a PM because he had some questions about the source code. I reread the code which made me recognize that I should have moved the thread-waiting to another position.

Thanks for pointing, Carlos!

Steffen

Virustotal scans of version 3.1:
x86: https://www.virustotal.com/en/file/364b ... /analysis/
x64: https://www.virustotal.com/en/file/a871 ... /analysis/

#52 Post by **aGerman** » 22 Apr 2018 10:26

I moved the project along with the download to SourceForge https://sourceforge.net/projects/convertcp/.
Of course I'll keep you updated in this thread.

I just hope to get a little more feedback about the source code since I think there are more members over there that are experienced in C.
Also people don't have to worry about whether or not questions about the code would be rather off-topic.

Steffen

miskox · #53 Post by **miskox** » 23 Apr 2018 02:57

aGerman wrote: ↑
20 Apr 2018 10:24
This having said, there is still a risk just in case you immediately access the new file while there might be yet unwritten data in the buffer. E.g. if you convert a file using CONVERTCP in a script and you want to process the new file in the very next line of the script then you might get trouble because physical writing of the data to the drive may still take a few hundreds of milliseconds (even if the file was already closed). Option /f forces the buffer flushing before CONVERTCP terminates which should protect you from unwritten data. Even though this doesn't necessarily mean that the data was already written to the physical drive memory. Drives may have an additional buffer and it depends on the driver settings whether or not the request of Windows to flush that additional buffer will be ignored. The latter is nothing that I'm even able to influence and this possible behavior of a drive would also cause issues for other programs that write data to the drive.

Steffen! Thanks for all the updates and the information!

My understanding (how I see things) is that even if a file is not written physicaly to the disk and I access it with another program/process the system would make sure data is consistent: system knows that there is some data still to be written to the disk so it would provide the correct data to the process requesting it (either from the disk or from the memory/cache).

Can anyone more knowledgeable about this give more info on this matter?

Thanks.
Saso

#54 Post by **aGerman** » 23 Apr 2018 04:08

This is all quite confusing Saso. It's not only the operating system but also the file system a drive was formatted, and of course the design of the drive itself.
According to the MSDN a mechanism like you stated may exist on NTFS formatted drives. See
https://msdn.microsoft.com/en-us/librar ... s.85).aspx
I can't say anything about how other file systems behave.

However, as stated in the second half of my answer that you quoted, the drive itself may behave unexpected.
https://blogs.msdn.microsoft.com/oldnew ... 0/?p=95505

Steffen

#55 Post by **Squashman** » 26 Apr 2018 09:31

Hi Steffen,

Is your program similar the GNU utility ICONV?

As you may remember I work on a mainframe and the EBCIDIC character set is all single byte. So when we start working with clients we tell them that they have to send us files in a single byte character set. Our sales people sold a job and neglected to tell the client about this requirement. Now we are stuck trying to figure out the best way to convert the UTF-8 file they sent us.

When we upload a file to the mainframe the conversion from ascii to ebcidic automatically happens on the fly. It also strips off the CR\LF. If we know there are extended ascii characters in the file we then send a quote command to the mainframe to tell it the file is ISO8859-1 and convert it to IBM-1047. We have never had any luck trying to do the conversion with ftp when the file is a multi-byte character set.

So I am wondering if it is better to convert the file from UTF-8 to EBCIDIC and then do a binary upload of the file to the mainframe or convert the file from UTF-8 to ISO8859-1 and then upload the file to the mainframe.

#56 Post by **Squashman** » 26 Apr 2018 10:58

So I ended up using your program to convert the file.

Code: Select all

CONVERTCP 65001 28591 /i "APPEAL.txt" /o "APPEAL_converted.txt"

The converted file was only 3 bytes smaller. So all it did was remove the BOM. This tells me there were not any variable byte characters in the files.

#57 Post by **aGerman** » 26 Apr 2018 12:15

I never used ICONV but since the purpose seems to be the same I assume the behavior is at least similar. I don't know how the character conversion of ICONV was implemented. In CONVERTCP I use the MultiByteToWideChar and WideCharToMultiByte API functions that internally access the code page files installed on your computer.

As to your example, do you think the result was right? Sure the BOM was removed as it has to be for ISO 8859-1. If there were no other characters than the 7 bit ASCII then you can't tell them apart.

Steffen

#58 Post by **aGerman** » 26 Apr 2018 12:26

Added option /n in version 4.0.

Virustotal scans of version 4.0:
x86: https://www.virustotal.com/en/file/f1aa ... /analysis/
x64: https://www.virustotal.com/en/file/f710 ... /analysis/

The /n is for "no threading". But what does it mean?
For a better understanding let's step back to the defaults. As already written CONVERTCP reads the incoming stream chunk-wise. The advantage is that an already converted chunk can be written using an asynchronuous thread at the same time as the next chunk of text is read and converted. That leads to a good performance. Furthermore the memory usage is limited to the buffer size the chunks need. This size doesn't increase even not if very large files are converted. Sounds like a good concept, doesn't it? That's the reason why threading is used by default.
Things are getting complicated if the end of the read chunk is somewhere inside of a sequence of multiple bytes that represents a single character. There are charsets that have rules to recognize where the a character ends. I already use these rules for the processing of UTF-8 and UTF-16 streams in order to adjust the chunk boundaries accordingly. But there exist charsets without those rules. Such streams could get corrupted if their size exceed 1 MB (which is the default chunk size). If you list the code pages using option /l you'll find in the second column whether or not you can convert incoming streams greater than 1 MB without the risk of damaging their content. So the conclusion is that threading is not always as good as it seems to be.
That's the point where option /n comes in. This option leads to reading the whole file into the buffer. Due to internal limits of the used API functions the size of the incoming stream is still restricted. But now it's 511 MB rather than only 1 MB. The needed buffer size might increase tremendously. If you have only little RAM space the tool may crash (even if that should be quite unlikely on modern computers).

tl;dr
Option /n might be of interest especially for large text encoded in UTF-7 or for people living in the eastern hemisphere (Chinese, Japanese, Korean people for example). If you are unsure if you need to pass option /n then first run CONVERTCP with option /l. If you find a "No" in the second column of the code page of your incoming stream AND the stream size (file size) is greater than 1 MB then use option /n for the conversion. If you need to convert a stream greater than 511 MB from such a code page then don't use CONVERTCP at all.

Steffen

carlos · #59 Post by **carlos** » 26 Apr 2018 16:42

I think the user should not care about things like the internal way of do the work.
I think is better that the program internally take the best decision for get the accurate work.
If the conditions are accomplish, it should use thread, otherwise it should not use thread. But this decision should be taken by the program, not by the user.

#60 Post by **aGerman** » 26 Apr 2018 17:00

I forgot to update the number of allowed options

Virustotal scans of version 4.1:
x86: https://www.virustotal.com/en/file/a641 ... /analysis/
x64: https://www.virustotal.com/en/file/1c0f ... /analysis/

@Carlos
Thanks for your feedback!
I already thought about that. The reason why I refrained from adding that feature in the first place was that I thought I had to enumerate all code pages in a callback every time the tool is called (similar to option /l). Meanwhile I think it can be done easier. In the VerifyCp function I already call the GetCPInfo function without using the filled CPINFO structure. I think I can fork the program flow if I evaluate the MaxCharSize member ... Certainly done in the next update

Steffen

DosTips.com

CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another