Page 1 of 1

Copying Single Source to Multiple Destinations in Parallel

Posted: 05 Jun 2019 22:33
by Samir
Most of us that have worked with computers have run into this problem in some form or another--you have one source and it needs to be copied to multiple destinations. And the normal solution is to run these copy operations serially, ie one at a time, so the execution time is N*NumberOfDestinations.

This is fine and dandy with small batch copies or high-bandwidth copies. But what about when the source is massive and/or the bandwidth is limited? Having to repeat N for each destination can drag out a 12hr copy to 5 sources into a 60hr week long job.

But in my research there is actually a solution that no one or no program seems to takes advantage of--drive caching.

Whenever a file is initially read and then read again, there's a strong chance that the file will be re-read from the cache at lightning speed. It simple to try, just copy a file that's 20MB or so and then compare it--the compare finishes almost instantly--especially if you repeat the compare command again.

When copies are done serially, each file being copied is read multiple times because each copy job is treated separately. But what if you copied each file in parallel. Each subsequent read on the same file would be a cache hit, thereby allowing the file to be read at much faster speeds for all the other copies. In fact, the only additional time it would take is for the write times on the destinations, which would also be in parallel, so your time would only increase by the time to write the file on the slowest destination.

For example, I have source A that I want to copy to drives 1, 2, and 3. To do so the logic would be something like:
copy sourceA drive1
copy sourceA drive2
copy sourceA drive3

Each of these would run after the other is executed.

Now these can be paralleled by using the start (or even call?) command
start copy sourceA drive1
start copy sourceA drive2
start copy sourceA drive3

Which would start each of these in parallel. However, in my experience with this, when you have a large number of source items, the file being copied simultaneously can get out of sync causing cache misses. And as this progresses, instead of improving cache hits, you've drastically increased source drive seek times which actually makes the whole operation slower. (I don't have any SSDs I've tried this with as those should not exhibit any issues with drive seek, just the same issue of a cache miss.)

But what if the parallelization was done on a file-by-file basis. Then you can almost surely have a cache hit and the multiple copies won't get out of sync. This is pretty each to implement when you have a simple directory of files as you can just loop through them using for, executing a parallel copy on each file. I think the logic would be something like this:
for %f in (*.*) do (
start copy %f drive1
start copy %f drive2
start copy %f drive3
)

But once you introduce multiple directories, and sub-directories, etc., I don't know how this would be implemented. I found the 'Additional Task for each File to be copied' here to be a useful guide in understanding where to start by changing the logic to creating f file list first, but I can't figure out what that logic would look like for traversing and entire source drive or tree branch. Perhaps using 'for /r' would be a solution here, but I'm just not sure.

And then there's also the question of which copy command to use on each file? copy? xcopy? robocopy? What would be the best for the overall goal, which is to reduce the execution time from N*NumberOfDestinations to N*SlowestDestination?

Any thoughts or ideas appreciated.

Re: Copying Single Source to Multiple Destinations in Parallel

Posted: 06 Jun 2019 08:28
by pieh-ejdsch
You can reduce the scenario to 2 times.
Divide and conquer.
After each copy you have twice as many sources. Now you copy from two different sources to two different destinations. The fast source serves the fast target.
You first make a full copy to the fastest disk.
like this

Code: Select all

setlocal
if %1. == :next. shift & goto :nextcopy
set source="D:\folder1"
set destin="E:\copy1" "F:\copy2" "G:\copy3" "H:\copy4" "I:\copy5" "J:\copy6"
Call :nextcopy %source% %destin%
pause
Exit /b

:nextcopy
robocopy %1 %2
set #1=%1
set #2=%2
if "%~3" == "" exit /b
set "x2=%~4"

:split
shift
shift
set #1=%#1% %1
set #2=%#2% %2
if NOT "%~3" == "" goto :split

:next
start "1" cmd /c "%~f0" :next %#1%
if defined x2 call :nextcopy %#2%
exit /b
[edit] correct line 5 to call :nextcopy [/edit]

Re: Copying Single Source to Multiple Destinations in Parallel

Posted: 06 Jun 2019 09:15
by Samir
That is an interesting idea, although the code is going to take me some time to understand.

The only issue I see is with bandwidth. If by having two simultaneous copies running the bandwidth for each copy gets halved, the net speed will still end up being the same.

Re: Copying Single Source to Multiple Destinations in Parallel

Posted: 06 Jun 2019 14:41
by pieh-ejdsch
I changed line 5 Call: nextcopy ...
(comes from when you write from the phone.)
Of course it makes no sense to send the data over the network several times.
Maybe a batch can be started from the server / network in order not to resend the data from the client. To start with a trigger for automatic execution.
Robocopy has the optiones /ipg:n or /mt:n to work better in the network.

Re: Copying Single Source to Multiple Destinations in Parallel

Posted: 06 Jun 2019 20:48
by Samir
pieh-ejdsch wrote:
06 Jun 2019 14:41
I changed line 5 Call: nextcopy ...
(comes from when you write from the phone.)
Of course it makes no sense to send the data over the network several times.
Maybe a batch can be started from the server / network in order not to resend the data from the client. To start with a trigger for automatic execution.
Robocopy has the optiones /ipg:n or /mt:n to work better in the network.
I never looked at all the switches for robocopy before. It looks like there is an option to 'monitor source', which if it was run in parallel, might actually cause files to always copy in parallel as the detection of the new files would be very close in time to each other.