Page 1 of 2

robust line counter

Posted: 03 May 2013 13:09
by Sponge Belly
Below is a little program to count the number of lines in a text file. It has the following features:
  • Supports Windows and Unix line endings.
  • Handles long lines.
  • Copes with notorious show-stoppers like Control Z and the Null Character.
  • Works with Unicode as well as ANSI text files.
  • Can count lines in files with more than Batch integer max lines.
Jump to this post for the latest version of the subroutine.

Re: robust line counter

Posted: 03 May 2013 17:35
by Ranguna173
I use this in my batches:

Code: Select all

findstr /R /N "^" %file% | find /C ":">lines
< lines set /p "num="
del lines

Re: robust line counter

Posted: 03 May 2013 18:05
by Squashman
I just use good old find /C. works for 99.9% of the data i work with.

Re: robust line counter

Posted: 03 May 2013 23:06
by foxidrive
Sponge Belly wrote:Hello All! :-)

Below is a little program to count the number of lines in a text file. It has the following features:

  • Supports Windows and Unix line endings (but not MacOS 9 or earlier).
  • Handles extremely long lines.
  • Copes with notorious show-stoppers like Control Z and the Null Character.



Can you test this on your suite of text files?

Code: Select all

find /c /v "" <file

Re: robust line counter

Posted: 04 May 2013 05:42
by Sponge Belly
Hi Again!

Thanks for your replies.

@Foxi: I did too run tests. :P One thing I discovered is that find turns null characters into newlines which causes an incorrect count.

@Squashman: You’re right, of course. Good old find /c /v "" < file is just fine… 99% of the time. My code is an attempt to deal with those less than 1% of edge cases where you may find yourself working with text that you can’t make any assumptions about.

@Ranguna173: D’oh! :oops: That’s so clever and so simple. Wish I’d thought of it! :-) And it isn’t tripped up by null characters, for some reason. Once again I am reminded of how much I have yet to learn…

- SB

Re: robust line counter

Posted: 04 May 2013 07:27
by foxidrive
Sponge Belly wrote:@Foxi: I did too run tests. :P One thing I discovered is that find turns null characters into newlines which causes an incorrect count.


Thanks, confirmed in Win 8 too.


But so does this when there's a null at the beginning of the line anyway.

Code: Select all

findstr  /r /n "^" a.bat |find /c ":"


and your code also fails here - this is the test file which should report 3 lines when checked. The three techniques here report 4 lines. http://www.astronomy.comoj.com/testbat.zip

Re: robust line counter

Posted: 04 May 2013 14:20
by Sponge Belly
Hi Foxi!

Had a look at your test file. It has four lines. The fourth line doesn’t end with a CR+LF, but it still counts. Get a hex dump from certutil if you don’t believe me. ;-)

I tried both mine and Ranguna173's code on a test file with null characters at the beginning, middle and end of lines. They weren’t fooled. Windows 7 Home Premium, fwiw.

Btw, Ranguna173's golden nugget can be rewritten as:

Code: Select all

for /f %%l in ('
findstr /n "^" "%~1" ^| find /c ":"
') do set lines=%%l
echo(file "%~1" has %lines% lines


Hope this helps! :-)

- SB

Re: robust line counter

Posted: 04 May 2013 15:00
by Squashman
what does it matter if the last line doesnt have a crlf. if the previous line does it still should be counted as a line.

Re: robust line counter

Posted: 04 May 2013 21:44
by foxidrive
Sponge Belly wrote:Hi Foxi!
Had a look at your test file. It has four lines. The fourth line doesn’t end with a CR+LF, but it still counts. Get a hex dump from certutil if you don’t believe me. ;-)


Yes, you are right. I was counting CRLF and didn't see the obvious.

Btw, Ranguna173's golden nugget can be rewritten as:

Code: Select all

for /f %%l in ('
findstr /n "^" "%~1" ^| find /c ":"
') do set lines=%%l
echo(file "%~1" has %lines% lines



That technique fails with a.txt in this file: http://www.astronomy.comoj.com/testfile.zip
where it reports 4 instead of 3 lines because of the NULL issue in FIND.EXE


This works however:

Code: Select all

@echo off
for /f "delims=:" %%a in ('findstr /n "^" "%~1"') do set lines=%%a
echo %lines% lines
pause

Re: robust line counter

Posted: 05 May 2013 15:04
by Sponge Belly
Dammit, Foxi. You’re right. :cry:

If a colon comes after a null character on a line of output from findstr /n "^" "%~1", find /c ":" will count it as two lines. That torpedos Ranguna173’s otherwise elegant solution. :-(

As to your second point, try this (note the null character on line 4):

Code: Select all

findstr /n "^" a2z.txt

Output:
1:the quick
2:brown fox
3:jumps over
4:the<NUL>lazy
5:dog


But if you wrap it up in a for /f loop, you’re in for a surprise:

Code: Select all

for /f delims^= %%l in ('
findstr /n "^" a2z.txt
') do echo(%%l

Output:
1:the quick
2:brown fox
3:jumps over
4:the5:dog


Near as I can tell, if a null character is output inside the in (...) clause of a for /f loop, the null character and anything following it is discarded up until the end of line. The newline is suppressed and the next line is appended instead. This all happens in one iteration of the loop, not two. In fact, the line to be output will keep growing so long as a null character is found on each successive line. :twisted:

But the good news is that the code in the OP still holds up. ;-)

- SB

Re: robust line counter

Posted: 06 May 2013 00:00
by foxidrive
Sponge Belly wrote:try this (note the null character on line 4):

Code: Select all

findstr /n "^" a2z.txt

Output:
1:the quick
2:brown fox
3:jumps over
4:the<NUL>lazy
5:dog


But if you wrap it up in a for /f loop, you’re in for a surprise:

Code: Select all

for /f delims^= %%l in ('
findstr /n "^" a2z.txt
') do echo(%%l

Output:
1:the quick
2:brown fox
3:jumps over
4:the5:dog


Near as I can tell, if a null character is output inside the in (...) clause of a for /f loop, the null character and anything following it is discarded up until the end of line. The newline is suppressed and the next line is appended instead. This all happens in one iteration of the loop, not two. In fact, the line to be output will keep growing so long as a null character is found on each successive line. :twisted:

But the good news is that the code in the OP still holds up. ;-)



That's interesting. Another gotcha to recall when needed. :)

Re: robust line counter

Posted: 20 Oct 2014 07:44
by Sponge Belly
Jump to this post for the latest version of the subroutine.

Re: robust line counter

Posted: 20 Oct 2014 12:36
by siberia-man

Code: Select all

more FILENAME | find /c /v ""


this works, as well

Re: robust line counter

Posted: 20 Oct 2014 15:47
by Squashman
siberia-man wrote:

Code: Select all

more FILENAME | find /c /v ""


this works, as well

Try that on a file with more than 65K lines.

Re: robust line counter

Posted: 20 Oct 2014 16:36
by foxidrive
I see that the method used by SB is supposed to handle NULs etc.

This is effective for plain text.

Code: Select all

find /c /v "" < FILENAME