RAID 1.3 via DOS?

Message

Samir · #1 Post by **Samir** » 20 Jul 2013 17:00

So here's the dilemma. RAID 1 is really great for redundancy--except when one of the bits on one of the drives gets corrupted. Then you don't know which is the 'correct' file!

I no longer use automated RAID1 because of this. So instead I manually mirror using xxcopy to 3 drives--what I dub RAID 1.3. So now if a bit changes on one of the drives, a simple compare to the other two drives usually sniffs out the bad copy. The problem is automating this task.

Because you guys are the sharpest batch file people I've ever seen, I'd like to hear your ideas on implementing a batch operation to do this 'error-checking'.

It would basically be given a set of three drive letters/paths. It would use one as the source. It would traverse down the tree and compare each file with the same file (with the same path since it's a mirror copy) on the other drives. It will note any discrepancies and automatically go into a comparison mode with all three drives being source to determine the bad file in a set of discrepancies. It will then, as an option, automatically replace the bad file with a good copy, with another option to rename the bad file to *.BAD. This would have to work on long filenames and be able to run in win9x/2000/xp environments or just xp if it's not feasible for the win9x/2000 command.com.

I know this is actually beyond my own batch file skills, but I'd love to attempt it with your help. Thank you!

#2 Post by **penpen** » 21 Jul 2013 12:54

It is not recommended to do that using batch... .
If you want a fast and secure way to do this i recommend you to learn c++ or something similar and some little bit about CRC32 and MD5. If you do that you are able to detect even small parts of files that had been corrupted. Also you need to read much fewer bytes from disk, as when using batch jobs. C++ is faster anyway and there are more advantages, but i don't want to talk too much about C++ as this far away from scripting.
Beside the files can get unverifyable easily, if 2 disks may produce errors.

If i were forced to program it using batch i intuitively would do it in this, or a similar, way (warning untested):

Code: Select all

:main
    @echo off
   cls
   setlocal enableDelayedExpansion

::   drive set
   set "driveSet=X Y Z"
   set "hdds=%driveSet: =%"
   for %%a in (0 1 2) do set "hdd[%%a]=!hdds:~%%a,1!"

   set "directories=directories.txt"
   set "files=files.txt"
   set "tmpFile=temp.txt"
   set "log=log.txt"



::   build %files% and %directories% to verify, eliminating multiple entries
   (rem:)>"%log%"

   (
      echo building %directories%

      (for %%a in (0 1 2) do for /f "tokens=1,* delims=:" %%b in ('dir !hdd[%%a]!:\ /A:D /B /O:N /S') do echo %%c) > "%directories%"
      sort "%directories%" /O "%tmpFile%"
      (
         set "lastLine="
         for /f "tokens=* delims=" %%b in ('findstr "^" "%tmpFile%"') do (
            if not "%%b" == "!lastLine!" (
               echo %%~b
               set "lastLine=%%~b"
            )
         )
      ) > "%directories%"

      del "%tmpFile%"
      echo building %directories%: finished
   ) >> "%log%"

   (
      echo building %files%

      (for %%a in (0 1 2) do for /f "tokens=1,* delims=:" %%b in ('dir !hdd[%%a]!:\ /A:-D /B /O:N /S') do echo %%c) > "%files%"
      sort "%files%" /O "%tmpFile%"
      (
         set "lastLine="
         for /f "tokens=* delims=" %%a in ('findstr "^" "%tmpFile%"') do (
            if not "%%a" == "!lastLine!" (
               echo %%~a
               set "lastLine=%%~a"
            )
         )
      ) > "%files%"
      del "%tmpFile%"

      echo building %files%: finished
   ) >> "%log%"


::   check for directory-file-name-collision and missing directories, create it if posssible
   (
      echo checking directories

      set "errors="
      set "arg="
      for /f "tokens=* delims=" %%b in ('findstr "^" "%directories%"') do (
         set "args="
         for %%a in (0 1 2) do (
            for %%c in ("!hdd[%%a]!:%%~b") do (
               if not exist %%c md %%c
               if not exist %%c (   
                  set "args=!args!F"
               ) else (
                  set "arg=%%~ac"
                  set "args=!args!!arg:~0,1!"
               )
            )
         )

         if not "!args:F=!" == "!args!" (
            echo checking directories: could not create missing diretory: %%c
            set "errors=true"
            set "args=!args:F=!"
         )

         if defined args if not "!args:D=!" == "" (
            echo checking directories: detected directory-file-name collision: %%b
            set "errors=true"
         )
      )

      if defined errors set "errors=with errors"
      echo checking directories: finished !errors!
   ) >> "%log%"

   if defined errors (
      echo There were errors, that cannot be fixed automatically.
      echo See %log% for further informations.
      exit /b 1
   )

::   checking files, create files if needed and possible
::   using coded ok in positive logic: bit 2: ok20, bit 1: ok12, bit 0: ok01
::   source: space(ok) --> driveSet
   set "source(1)=%hdd[1]%"
   set "source(2)=%hdd[2]%"
   set "source(3)=%hdd[2]%"
   set "source(4)=%hdd[0]%"
   set "source(5)=%hdd[0]%"

::   target: space(ok) --> driveSet
   set "target(1)=%hdd[2]%"
   set "target(2)=%hdd[0]%"
   set "target(3)=%hdd[0]%"
   set "target(4)=%hdd[1]%"
   set "target(5)=%hdd[1]%"

   (
      echo checking files

      for /f "tokens=* delims=" %%b in ('findstr "^" "%files%"') do (
         set "ok=7"

         if not exist %hdd[0]%:%%b set /A "ok&=2"
         if not exist %hdd[1]%:%%b set /A "ok&=4"
         if not exist %hdd[2]%:%%b set /A "ok&=1"

         if !ok! GEQ 4 (
            fc /A /B "%hdd[0]%:%%b" "%hdd[2]%:%%b" > nul
            if errorlevel 1 set "ok&=3"
         )

         set /A "check=!ok!&2"
         if !check! == 2 (
            fc /A /B "%hdd[1]%:%%b" "%hdd[2]%:%%b" > nul
            if errorlevel 1 set "ok&=5"
         )

         if !ok! == 1 (
            fc /A /B "%hdd[0]%:%%b" "%hdd[1]%:%%b" > nul
            if errorlevel 1 set "ok&=6"
         )

         if not !ok! == 7 (
            if !ok! GTR 0 (
               for %%c in (!ok!) do (
                  set "source=!source(%%c)!:%%b"
                  set "target=!target(%%c)!:%%b"
               )

               if not exist !target! (rem:)>"!target!"
               xcopy %source% %target% /V /H /R /K /O /X /Y > nul
               if errorlevel 1 fc /A /B %source% %target% > nul
               if errorlevel 1 (
                  echo checking files: updating failed: source: %source%
                  echo checking files: updating failed: target: %source%
                  set "errors=true"
               )
            ) else (
               echo checking files: unverifiable %%b
               set "errors=true"
            )
         )
      )

      if defined errors set "errors=with errors"
      echo checking files: finished !errors!
   ) >> "%log%"

   if defined errors (
      echo There were errors, that cannot be fixed automatically.
      echo See %log% for further informations.
      exit /b 1
   )

   echo Finished software raid 1.3 update successfully.
   goto :eof

penpen

Samir · #3 Post by **Samir** » 21 Jul 2013 13:51

penpen wrote:It is not recommended to do that using batch... .
If you want a fast and secure way to do this i recommend you to learn c++ or something similar and some little bit about CRC32 and MD5. If you do that you are able to detect even small parts of files that had been corrupted. Also you need to read much fewer bytes from disk, as when using batch jobs. C++ is faster anyway and there are more advantages, but i don't want to talk too much about C++ as this far away from scripting.
Beside the files can get unverifyable easily, if 2 disks may produce errors.

I like batch because it's stable, available, and relatively fast. I've taken some courses in C, but it gets really complicated and just as IO intensive if the entire file is being read. There are MD5 generator/verifiers out there, but that doesn't really do the job of comparing one file to another. And if an md5 file itself gets corrupted, well that becomes another issue in itself.

Definitely an amazing amount batch to read! Thank you for sharing! It's going to take me weeks if not months to have enough time to parse all the way through it, and I'm sure I'll have some questions. But this is a great start on this project.

#4 Post by **penpen** » 21 Jul 2013 14:42

Samir wrote:... but it gets really complicated and just as IO intensive if the entire file is being read.

No. If you have file F on 3 drives A,B,C then you have to compare it pairwise using batch, so you load at least this file (from different locations) at 4 times if there is no difference, or up to 6 times if one file location is corrupted.
For example:

Code: Select all

:: A:\F ?= B:\F read two times:
fc /b A:\F B:\F

:: B:\F ?= C:\F read two times:
fc /b B:\F C:\F

:: (not A:\F == B:\F) and (not B:\F == C:\F) because B:\F differs, then read another 2 times
fc /b A:\F C:\F

Using C++ or similar lets you read this file only 3 times, as you can stop reading the file at any position and compare the data parts in buffers: much faster.

Samir wrote:There are MD5 generator/verifiers out there, but that doesn't really do the job of comparing one file to another.

This is only the simplest use of using these checksums, and the use of it should not compare them:
You can use them smaller blocks of bytes than whole files, and you may overlap these blocks, ... ... .
If you do it in an adequate way you may even restore data of blocks up to 128 bytes: impossible, to do this using simple batch, maybe with VBS or JScript, but i won't bet.

Samir wrote:And if an md5 file itself gets corrupted, well that becomes another issue in itself.

No it is the same problem, as for example the crc32 checksum is part of the protected data:
crc32 algorithm on (data, 0x00000000) gives you the crc32-checksum and
crc32 algorithm on (data, crc32-checksum) gives you 0x00000000 as the new crc32-checksum.
Edit: Sorry On pure MD5 checksum it is INDEED another issue, but not last because of that crc32 should be used additionally.

penpen

Samir · #5 Post by **Samir** » 21 Jul 2013 22:09

penpen wrote:
Samir wrote:... but it gets really complicated and just as IO intensive if the entire file is being read.
No. If you have file F on 3 drives A,B,C then you have to compare it pairwise using batch, so you load at least this file (from different locations) at 4 times if there is no difference, or up to 6 times if one file location is corrupted.
For example:
Code: Select all
:: A:\F ?= B:\F read two times:
fc /b A:\F B:\F

:: B:\F ?= C:\F read two times:
fc /b B:\F C:\F

:: (not A:\F == B:\F) and (not B:\F == C:\F) because B:\F differs, then read another 2 times
fc /b A:\F C:\F
Using C++ or similar lets you read this file only 3 times, as you can stop reading the file at any position and compare the data parts in buffers: much faster.

Samir wrote:There are MD5 generator/verifiers out there, but that doesn't really do the job of comparing one file to another.
This is only the simplest use of using these checksums, and the use of it should not compare them:
You can use them smaller blocks of bytes than whole files, and you may overlap these blocks, ... ... .
If you do it in an adequate way you may even restore data of blocks up to 128 bytes: impossible, to do this using simple batch, maybe with VBS or JScript, but i won't bet.

Samir wrote:And if an md5 file itself gets corrupted, well that becomes another issue in itself.
No it is the same problem, as for example the crc32 checksum is part of the protected data:
crc32 algorithm on (data, 0x00000000) gives you the crc32-checksum and
crc32 algorithm on (data, crc32-checksum) gives you 0x00000000 as the new crc32-checksum.
Edit: Sorry On pure MD5 checksum it is INDEED another issue, but not last because of that crc32 should be used additionally.

penpen

True, but if most of the files are the same, you still have to read them completely. The only time saved will be on corrupt files--which we hope would not be many!

Oh, I'm sure there's all sorts of sophisticated ways to do it even to a block or sector level, but that starts getting into RAID3/4/5. And of course, not really a way to do it with batch unless someone is using debug.

With a crc as part of the file, that does alleviate some of the problem, but it can affect file usability depending on the file format and how the application/os is going to read the file.

#6 Post by **penpen** » 22 Jul 2013 01:39

Even in the good case you are reading only 75% using C++ or similar, instead of using pure batch.

And the checksum must not be really a part of the file, you may decide to compute the md5 hash and crc32 checksum all 2024 (just an exmple) bytes of file data, and store it somewhere else, so it doesn't affect its usability.

penpen

#7 Post by **Squashman** » 22 Jul 2013 05:52

Samir wrote:I like batch because it's stable, available, and relatively fast. I've taken some courses in C, but it gets really complicated and just as IO intensive if the entire file is being read.

Going to have to totally disagree with you on that. Any compiled language like C is going to be ten folder faster at performing file operations then batch.

Samir · #8 Post by **Samir** » 09 Sep 2013 16:18

Squashman wrote:
Samir wrote:I like batch because it's stable, available, and relatively fast. I've taken some courses in C, but it gets really complicated and just as IO intensive if the entire file is being read.

Going to have to totally disagree with you on that. Any compiled language like C is going to be ten folder faster at performing file operations then batch.

Depends on the compiler. If I want to compile this for pure DOS, it's just ancient turbo c, which I've found is no faster than batch on pure file reads.

Samir · #9 Post by **Samir** » 09 Sep 2013 16:21

penpen wrote:Even in the good case you are reading only 75% using C++ or similar, instead of using pure batch.

And the checksum must not be really a part of the file, you may decide to compute the md5 hash and crc32 checksum all 2024 (just an exmple) bytes of file data, and store it somewhere else, so it doesn't affect its usability.

penpen

I'm a bit puzzled by how by in the good case it's only 75% the reading? If I understand correct, you're saying that a C read will stop when it encounters an error, where as batch will continue reading. If the files are all the same, wouldn't C and batch read the entire file? :?:

Computing an md5 and crc32 sidecar file would work, but then there's even more files. Although the comparison would be much quicker since the files would be smaller.

#10 Post by **penpen** » 09 Sep 2013 17:08

Samir wrote:Depends on the compiler. If I want to compile this for pure DOS, it's just ancient turbo c, which I've found is no faster than batch on pure file reads.

If it is an ANSI C/C++ compiler, then it depends on the knowledge (and an iron will) of the programmer,
as you can use assembler code, or at minimum opcode, and construct file streams and buffers by yourself. :mrgreen:

Samir wrote:I'm a bit puzzled by how by in the good case it's only 75% the reading? If I understand correct, you're saying that a C read will stop when it encounters an error, where as batch will continue reading. If the files are all the same, wouldn't C and batch read the entire file?

No, this is not what i've meant.

You are using fc to compare files, and the good case is that all files are all the same.
Lets assume these files with the same content are named A, B and C.
With fc you have to do one of the following cases, to determine if they are equal:

Code: Select all

:: case 1: compare {A, B}, compare {A, C}
:: case 2: compare {A, B}, compare {B, C}
:: case 3: compare {A, C}, compare {A, B}
:: case 4: compare {A, C}, compare {C, B}
:: case 5: compare {B, C}, compare {B, A}
:: case 6: compare {B, C}, compare {C, A}
:: so all cases are equivalent, you have to do this: (just shown for case 1:
fc /b A B
fc /b A C

So you read file A two times, and file B and C one time from hdd:
In the whole you have read this file 4 times (from different locations).

With ANSI C, C++, or any other language with the needed capabilities, you just read (in full or in parts) the files from disk into RAM.
Then you just compare the content of the (RAM) buffer with each others.
Doing it this way, you have only read the (same) file (from different locations)only 3 times from disk.

So you read only 3/4 = 75% of the data from disk.
And because of the low (compared to RAM) speed of the hdds you additionally need nearly 3/4 of the time.

penpen

Samir · #11 Post by **Samir** » 24 Nov 2013 21:27

Ahh, I know what you mean about the compilers as assembly can be tedious but brutally efficient.

And your 75% case makes much more sense now. But given that most file sizes are within the cache limits of the drive or operating system, would it be safe to bet on file A (in your example) being a cache hit vs being re-read?

#12 Post by **penpen** » 25 Nov 2013 06:20

It is always unsafe, to bet on any state on caches, except you are writing a cache driver, that should know its own internal state.
But i didn't bet on any cache state in my above example;
i 've used buffering instead of caching:

penpen wrote:Then you just compare the content of the (RAM) buffer with each others.

So there are no hit/(re)read problems.

penpen

Samir · #13 Post by **Samir** » 26 Nov 2013 11:55

penpen wrote:It is always unsafe, to bet on any state on caches, except you are writing a cache driver, that should know its own internal state.
But i didn't bet on any cache state in my above example;
i 've used buffering instead of caching:
penpen wrote:Then you just compare the content of the (RAM) buffer with each others.
So there are no hit/(re)read problems.

penpen

Makes sense except you'd have to enough RAM as 2x the largest file size.

#14 Post by **penpen** » 26 Nov 2013 14:05

There is no need to create a buffer that can hold any file on your system.
It suffices to use a buffer with the capacity to store the hdd cache content to.
So 4 MB per file on most systems is all you need to compare the file content piecewise.

penpen

Samir · #15 Post by **Samir** » 26 Nov 2013 19:48

penpen wrote:There is no need to create a buffer that can hold any file on your system.
It suffices to use a buffer with the capacity to store the hdd cache content to.
So 4 MB per file on most systems is all you need to compare the file content piecewise.

penpen

Forgive me for not completely understanding, but if it's a piecewise comparison, won't the whole file minus that in the buffer and cache still have to be read? :?:

DosTips.com

RAID 1.3 via DOS?

RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?

Re: RAID 1.3 via DOS?