JREPL.BAT, ADO Streams and big files

Message

#1 Post by **dbenham** » 27 Sep 2017 21:34

This discussion is a continuation of a topic that was started here, that became tangential to the CONVERTCP topic.

aGerman wrote:On the other hand the main scopes of JREPL and CONVERTCP are quite different. This makes that JREPL is able (and designed) to do customized replacements while CONVERTCP can't do that. But this also makes that CONVERTCP is so much faster for big files (307s JREPL vs. 9s CONVERTCP for 360MB text Windows-1252 to UTF-8 in my tests) because it efficiently converts and writes in parallel threads. Converting big files was one of Saso's original requirements.

Thanks for those comparative timings Steffen. The CONVERTCP timings are impressive :!:

In most circumstances, CONVERTCP is clearly the better option.

The only reasons I can think for ever using JREPL instead are:
1) If your work environment disallows 3rd party exe files
2) If you already have JREPL available, and don't want to bother getting CONVERTCP
3) If you need custom transformations.

I expected JREPL to be slower, but I was a bit shocked to see that JREPL was 30 times slower than CONVERTCP :shock:

I did some timings to see what effect ADO has on reading and writing vs. using the native FiieSystemObject text stream.

Code: Select all

JREPL timings for reading and writing
  a 155MB Windows-1252 endoded file

       Read   Write
      Method  Method  Seconds
      -----------------------
      Native  Native      87
      Native  ADO         71
      ADO     Native     280
      ADO     ADO        245

So there is a severe performance penalty when using ADO to read (factor of 3+). Yet writing with ADO was marginally (but consistently) 10% faster

Dave

#2 Post by **aGerman** » 28 Sep 2017 01:40

Dave

First of all I think in most cases it doesn't matter much if JREPL is slower. For a lot of people it'll be rather an exception to convert text files with hundreds of MB of text. Although it could be of interest again if you have to convert a bunch of smaller text files. Another problem that Saso faced was that he was running out of memory with the first version I wrote. I read the entire text into memory before I began to convert it. That was the point where I had to begin reading the file chunk-wise. This was also an improvement of speed. Not only that it enabled to convert in parallel threads it also avoided reallocations of RAM space. I think this is one of the main reasons why ADO is slow. I assume that it allocates a certain amount of space. If the file is bigger it reallocates more space again and again. Depending on the internal implementation of the buffer it might require a contiguous range in RAM which will lead to copy all of the already read data to another position in the memory where the system was able to allocate enough space in a single block. In the worst case this will also happen several times.
Currently I don't have any good idea how you could work around it. Reading line-wise could be worth a try. If (and only if) ADO does not release the block of memory that it already allocated the speed could be improved tremendously. A little pitfall is that ADO always adds the BOM for multibyte codepages. You have to skip it except for the first line.

Steffen

#3 Post by **dbenham** » 28 Sep 2017 05:50

JREPL always logically reads and writes one line at a time (unless the /M multi-line option is used)

When using the native FileScriptingObject, the physical reads and writes are also one line at a time, though I imagine there may be some buffering going on behind the scene. A text stream is opened either for reading or writing, but never both.

But when ADO reads or writes a line, everything is done via memory only (presumably lots of virtual memory with large files). A single ADO stream is always bidirectional, supporting both reads and writes. After JREPL opens an input file with ADO, I explicitly load the file content into ADO memory, and the starting stream position is at the beginning of the data, ready to begin reading lines.
Conversely, JREPL writes each output line to memory in a different ADO stream, and then just before closing, I explicitly save the output stream to file.

I now realize there is one additional advantage of JREPL when dealing with multi-byte input encodings - I never have to worry about chunk boundaries, so there is no restriction on input size for any of the supported encodings (character sets).

Dave Benham

#4 Post by **aGerman** » 28 Sep 2017 11:53

I hope you agree with my decision to split the above posts away from the CONVERTCP topic, Dave. I did it for two reasons
1) It doesn't make much sense to discuss about JScript and ADO in a thread about a C utility.
2) My hope is that we reach some more people with ADO experiences if we have a thread title about our actual discussion.

Even if the theme is not directly about Batch I think it's okay to leave it in the public part of the forum because the objective is to improve an excellent command line tool.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Back to the topic ...

The source code of JREPL is quite long and you know best how you implemented the ADO handling. So I'm a little dependent on your confirmations or denials.

My observation was that ADO reads the whole content of the file into memory. At least the memory usage of CSCRIPT grew up the whole time. Please correct me if I'm wrong but as I understand your code it loads the entire text into only one stream object. In my last reply I wrote about my assumptions regarding the implemented memory handling of ADO. So what I believe is that a lot of memory reallocations and additional copy operations are performed internally in order to have the whole stream buffer in a consecutive memory range. That's most likely the reason why it take such a long time and why it could be better to read only chunks of the file. My proposal was to read and write line-wise using always the same stream object. I just hope that reallocations are only performed in case the current line is longer than the line before and the stream buffer will not be allocated every time new. But the more I think about it the more I'm afraid that this won't work. The SaveToFile method has no option to append to an existing file

Steffen

#5 Post by **dbenham** » 28 Sep 2017 18:20

aGerman wrote:I hope you agree with my decision to split the above posts away from the CONVERTCP topic, Dave. I did it for two reasons
1) It doesn't make much sense to discuss about JScript and ADO in a thread about a C utility.
2) My hope is that we reach some more people with ADO experiences if we have a thread title about our actual discussion.

No problem. I was worried about hijacking your thread when I posted, so I think it is good that the discussion is migrated here. And I also think it is good that there is still a post within your CONVERTCP topic that references JREPL as a viable alternative, with its own pros and cons.

aGerman wrote:Back to the topic ...

I think you missed my point as to how ADO works (or else I have a misconception).

When you open up an ADO stream, it starts out empty, and it is not associated with any file. The only way to load file content into the stream is to use adoStream.LoadFromFile("filename"). This loads the entire file into memory, there is no mechanism to load just a portion of the file. It also destroys any prior content that may have been in the stream prior to loading the file.

Once loaded, you can freely position the stream position anywhere within the stream (memory) content. There are methods to read entire lines, or a fixed number of characters, from the current stream position. There are also methods to write to the current stream position, overwriting any existing content that may be at that position.

Conversely, the only way to write stream content to a file is to use adoStream.SaveToFile("filename"). It writes the entire stream content to disk in one operation, with no mechanism to write chunks.

There is an alternate way of using ADO that requires you treat the file as a database, with each line representing a record. But I have to believe that would have even more overhead and be less efficient. It also would not be as easy to plug into my existing JREPL architecture. The code for emulating a FileScriptingObject text stream object would be more complex.

Dave Benham

#6 Post by **aGerman** » 29 Sep 2017 10:40

dbenham wrote:so I think it is good that the discussion is migrated here. And I also think it is good that there is still a post within your CONVERTCP topic that references JREPL as a viable alternative, with its own pros and cons.

Thanks for adding the cross references to the posts. Please feel free to post updates for any new JREPL feature. As I said - I like to give the users the informations to choose the right tool to meet their needs.

dbenham wrote:I think you missed my point as to how ADO works (or else I have a misconception).

I'm quite certain I understood how it works, Dave. Admittedly I don't know how the ADO stream class was implemented because it's not open-source. That's why all I wrote are only attemptions.
As to my suggestion - most likely I wasn't quite clear. I assume my wording doesn't sound correct to a native speaker :lol:

(and the different time zones don't make it easier to clearify things quickly). So I'll try to write some lines of code to illustrate what I was thinking about (even though it's not working).

*.js

Code: Select all

var inName = 'source.txt',
    outName = 'destination.txt',
    outCp = 'UTF-8',
    utf8BomLen = 3;

var objInfile = WScript.CreateObject('Scripting.FileSystemObject').OpenTextFile(inName),
    objAdoS = WScript.CreateObject('ADODB.Stream'),
    isFirst = true;

objAdoS.Type = 1; // adTypeText
objAdoS.CharSet = outCp;
objAdoS.LineSeparator = -1; // CrLf
objAdoS.Open();

while (!objInfile.AtEndOfStream) {
  // the disadvantage of ReadLine is that you can't specify the codepage of the text read
  objAdoS.WriteText(objInfile.ReadLine(), 1); // adWriteLine
  if (isFirst) {
    objAdoS.SaveToFile(outName, 2); // adSaveCreateOverWrite
    isFirst = false;
  } else {
    // doesn't work, always the whole stream will be written
    objAdoS.Position = utf8BomLen;
    // doesn't work because either adSaveCreateOverWrite truncates the content
    //  or adSaveCreateNotExist creates a new file
    //  there's no option to append to an existing file
    objAdoS.SaveToFile(outName, 1); // adSaveCreateNotExist
  }
  objAdoS.Position = 0;
}

Steffen

#7 Post by **dbenham** » 29 Sep 2017 12:48

Yes. Based on your code comments, we seem to be in agreement as to ADO limitations regarding reading/writing text files.

ADO always reads or writes the entire file. There is no way to work with a chunk or line at the file level. Only within memory.

Given that limitation, I don't see any avenue for read/write optimization.

#8 Post by **aGerman** » 29 Sep 2017 14:34

Too bad

DosTips.com

JREPL.BAT, ADO Streams and big files

JREPL.BAT, ADO Streams and big files

Re: CONVERTCP.exe - Convert text from one code page to another

Re: CONVERTCP.exe - Convert text from one code page to another

Re: JREPL.BAT, ADO Streams and big files

Re: JREPL.BAT, ADO Streams and big files

Re: JREPL.BAT, ADO Streams and big files

Re: JREPL.BAT, ADO Streams and big files

Re: JREPL.BAT, ADO Streams and big files