robust line counter

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
Sponge Belly
Posts: 196
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

robust line counter

#1 Post by Sponge Belly » 03 May 2013 13:09

Below is a little program to count the number of lines in a text file. It has the following features:
  • Supports Windows and Unix line endings.
  • Handles long lines.
  • Copes with notorious show-stoppers like Control Z and the Null Character.
  • Works with Unicode as well as ANSI text files.
  • Can count lines in files with more than Batch integer max lines.
Jump to this post for the latest version of the subroutine.
Last edited by Sponge Belly on 28 Feb 2018 13:59, edited 4 times in total.

Ranguna173
Posts: 104
Joined: 28 Jul 2011 17:32

Re: robust line counter

#2 Post by Ranguna173 » 03 May 2013 17:35

I use this in my batches:

Code: Select all

findstr /R /N "^" %file% | find /C ":">lines
< lines set /p "num="
del lines

Squashman
Expert
Posts: 4107
Joined: 23 Dec 2011 13:59

Re: robust line counter

#3 Post by Squashman » 03 May 2013 18:05

I just use good old find /C. works for 99.9% of the data i work with.

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: robust line counter

#4 Post by foxidrive » 03 May 2013 23:06

Sponge Belly wrote:Hello All! :-)

Below is a little program to count the number of lines in a text file. It has the following features:

  • Supports Windows and Unix line endings (but not MacOS 9 or earlier).
  • Handles extremely long lines.
  • Copes with notorious show-stoppers like Control Z and the Null Character.



Can you test this on your suite of text files?

Code: Select all

find /c /v "" <file

Sponge Belly
Posts: 196
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: robust line counter

#5 Post by Sponge Belly » 04 May 2013 05:42

Hi Again!

Thanks for your replies.

@Foxi: I did too run tests. :P One thing I discovered is that find turns null characters into newlines which causes an incorrect count.

@Squashman: You’re right, of course. Good old find /c /v "" < file is just fine… 99% of the time. My code is an attempt to deal with those less than 1% of edge cases where you may find yourself working with text that you can’t make any assumptions about.

@Ranguna173: D’oh! :oops: That’s so clever and so simple. Wish I’d thought of it! :-) And it isn’t tripped up by null characters, for some reason. Once again I am reminded of how much I have yet to learn…

- SB

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: robust line counter

#6 Post by foxidrive » 04 May 2013 07:27

Sponge Belly wrote:@Foxi: I did too run tests. :P One thing I discovered is that find turns null characters into newlines which causes an incorrect count.


Thanks, confirmed in Win 8 too.


But so does this when there's a null at the beginning of the line anyway.

Code: Select all

findstr  /r /n "^" a.bat |find /c ":"


and your code also fails here - this is the test file which should report 3 lines when checked. The three techniques here report 4 lines. http://www.astronomy.comoj.com/testbat.zip

Sponge Belly
Posts: 196
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: robust line counter

#7 Post by Sponge Belly » 04 May 2013 14:20

Hi Foxi!

Had a look at your test file. It has four lines. The fourth line doesn’t end with a CR+LF, but it still counts. Get a hex dump from certutil if you don’t believe me. ;-)

I tried both mine and Ranguna173's code on a test file with null characters at the beginning, middle and end of lines. They weren’t fooled. Windows 7 Home Premium, fwiw.

Btw, Ranguna173's golden nugget can be rewritten as:

Code: Select all

for /f %%l in ('
findstr /n "^" "%~1" ^| find /c ":"
') do set lines=%%l
echo(file "%~1" has %lines% lines


Hope this helps! :-)

- SB

Squashman
Expert
Posts: 4107
Joined: 23 Dec 2011 13:59

Re: robust line counter

#8 Post by Squashman » 04 May 2013 15:00

what does it matter if the last line doesnt have a crlf. if the previous line does it still should be counted as a line.

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: robust line counter

#9 Post by foxidrive » 04 May 2013 21:44

Sponge Belly wrote:Hi Foxi!
Had a look at your test file. It has four lines. The fourth line doesn’t end with a CR+LF, but it still counts. Get a hex dump from certutil if you don’t believe me. ;-)


Yes, you are right. I was counting CRLF and didn't see the obvious.

Btw, Ranguna173's golden nugget can be rewritten as:

Code: Select all

for /f %%l in ('
findstr /n "^" "%~1" ^| find /c ":"
') do set lines=%%l
echo(file "%~1" has %lines% lines



That technique fails with a.txt in this file: http://www.astronomy.comoj.com/testfile.zip
where it reports 4 instead of 3 lines because of the NULL issue in FIND.EXE


This works however:

Code: Select all

@echo off
for /f "delims=:" %%a in ('findstr /n "^" "%~1"') do set lines=%%a
echo %lines% lines
pause

Sponge Belly
Posts: 196
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: robust line counter

#10 Post by Sponge Belly » 05 May 2013 15:04

Dammit, Foxi. You’re right. :cry:

If a colon comes after a null character on a line of output from findstr /n "^" "%~1", find /c ":" will count it as two lines. That torpedos Ranguna173’s otherwise elegant solution. :-(

As to your second point, try this (note the null character on line 4):

Code: Select all

findstr /n "^" a2z.txt

Output:
1:the quick
2:brown fox
3:jumps over
4:the<NUL>lazy
5:dog


But if you wrap it up in a for /f loop, you’re in for a surprise:

Code: Select all

for /f delims^= %%l in ('
findstr /n "^" a2z.txt
') do echo(%%l

Output:
1:the quick
2:brown fox
3:jumps over
4:the5:dog


Near as I can tell, if a null character is output inside the in (...) clause of a for /f loop, the null character and anything following it is discarded up until the end of line. The newline is suppressed and the next line is appended instead. This all happens in one iteration of the loop, not two. In fact, the line to be output will keep growing so long as a null character is found on each successive line. :twisted:

But the good news is that the code in the OP still holds up. ;-)

- SB

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: robust line counter

#11 Post by foxidrive » 06 May 2013 00:00

Sponge Belly wrote:try this (note the null character on line 4):

Code: Select all

findstr /n "^" a2z.txt

Output:
1:the quick
2:brown fox
3:jumps over
4:the<NUL>lazy
5:dog


But if you wrap it up in a for /f loop, you’re in for a surprise:

Code: Select all

for /f delims^= %%l in ('
findstr /n "^" a2z.txt
') do echo(%%l

Output:
1:the quick
2:brown fox
3:jumps over
4:the5:dog


Near as I can tell, if a null character is output inside the in (...) clause of a for /f loop, the null character and anything following it is discarded up until the end of line. The newline is suppressed and the next line is appended instead. This all happens in one iteration of the loop, not two. In fact, the line to be output will keep growing so long as a null character is found on each successive line. :twisted:

But the good news is that the code in the OP still holds up. ;-)



That's interesting. Another gotcha to recall when needed. :)

Sponge Belly
Posts: 196
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: robust line counter

#12 Post by Sponge Belly » 20 Oct 2014 07:44

Jump to this post for the latest version of the subroutine.
Last edited by Sponge Belly on 28 Feb 2018 14:00, edited 2 times in total.

siberia-man
Posts: 126
Joined: 26 Dec 2013 09:28
Contact:

Re: robust line counter

#13 Post by siberia-man » 20 Oct 2014 12:36

Code: Select all

more FILENAME | find /c /v ""


this works, as well

Squashman
Expert
Posts: 4107
Joined: 23 Dec 2011 13:59

Re: robust line counter

#14 Post by Squashman » 20 Oct 2014 15:47

siberia-man wrote:

Code: Select all

more FILENAME | find /c /v ""


this works, as well

Try that on a file with more than 65K lines.

foxidrive
Expert
Posts: 6033
Joined: 10 Feb 2012 02:20

Re: robust line counter

#15 Post by foxidrive » 20 Oct 2014 16:36

I see that the method used by SB is supposed to handle NULs etc.

This is effective for plain text.

Code: Select all

find /c /v "" < FILENAME

Post Reply