findstr regex

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
Sponge Belly
Posts: 216
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

findstr regex

#1 Post by Sponge Belly » 26 Mar 2014 19:30

Dear DosTippers,

I’m trying to compose a findstr regex. A wrap-around regex that starts with a line feed and then matches any character except a line feed, followed by zero or more carriage returns and another line feed.

Trouble is, "." (dot) matches everything except LF or CR, and "[^!lf!]*" isn’t working the way I hoped it would. So, is there a way to match every character (including CR), except LF?

Why? Because if I ever find this Holy Graille of a regex, I'll be able to do this:

Code: Select all

findstr /nrv "regex" filename.txt


which will hopefully automagically spit out the number and contents of the last line of a file, regardless of size or length, etc.

TIA! :-)

- SB

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: findstr regex

#2 Post by dbenham » 26 Mar 2014 21:29

Great idea :D

I cannot get a perfect result because a FINDSTR bug allows the dot to match end-of-file if the line is not terminated by a lineFeed.

But I can get very close. :D
The script will print an extra line if the last line is not terminated by a linefeed.

Code: Select all

@echo off
setlocal enableDelayedExpansion

:: Define LF to contain a lineFeed
set ^"LF=^

^" The empty line above is critical - DO NOT REMOVE

:: Define CR to contain a carriageReturn
for /f %%A in ('copy /Z "%~dpf0" nul') do set "CR=%%A"

:: The following will print out either:
::   - the last line if the last line is terminated by <LF>
::   - the last two lines if the last line is not terminated by <LF>
findstr /nrv /c:"!lf!.*!cr!*!lf!" file.txt

Note that FINDSTR must open the file directly. It will not work if you pipe the output of TYPE into FINDSTR.

The above technique can be used to build an efficient TAIL.BAT script that supports display of up to 25 lines on XP, and 50 lines on more modern versions of Windows. Again, one additional line will be displayed if the last line is not terminated by a linefeed. An optional /N parameter can be used to prefix the displayed lines with their line number.

Code: Select all

:: tail.bat [/N]  File  [LineCount]
::
::   Print the last LineCount lines of File.
::
::   LineCount defaults to 1 if not specified.
::
::   The maximum LineCount is 25 on XP, and 50 on later Windows versions.
::
::   If the last line of File is not terminated by LineFeed then
::   LineCount+1 lines will be printed.
::
::   If the /N option is specified, then each line is prefixed
::   with the line number.
::
@echo off
setlocal enableDelayedExpansion

:: Define LF to contain a lineFeed
set ^"LF=^

^" The empty line above is critical - DO NOT REMOVE

:: Define CR to contain a carriageReturn
for /f %%A in ('copy /Z "%~dpf0" nul') do set "CR=%%A"

set "N="
if /i "%~1"=="/N" (
  set "n=n"
  shift /1
)

set "cnt="
set "cnt=%~2"
if not defined cnt set "cnt=1"

set "search=!lf!"
for /l %%N in (1 1 %cnt%) do set "search=!search!.*!cr!*!lf!"

findstr /rv%n% "!search!" "%~1"


Dave Benham

Sponge Belly
Posts: 216
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: findstr regex

#3 Post by Sponge Belly » 27 Mar 2014 07:03

Hi Dave,

Great idea! Why didn’t I think of that? :oops:

Only problem is, embedded CRs will trigger a false positive. That’s why I was asking if it was possible to specify a range of all characters (including CR) except for LF.

- SB

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: findstr regex

#4 Post by dbenham » 27 Mar 2014 08:16

Good catch regarding embedded <CR>. :)

No, I am almost sure it is impossible to construct an expression that matches any character other than <LF> for the following two reasons:

1) FINDSTR does not support alternation (a|b), so a the only other possibility is a character set.

2) A character set in square brackets will fail if it includes character 0xFF, regardless whether the character is expicit or implicit via a range.


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: findstr regex

#5 Post by penpen » 28 Mar 2014 17:18

dbenham wrote:1) FINDSTR does not support alternation (a|b), so a the only other possibility is a character set.
This is wrong: findstr supports a rudimentary form of alternation.

Code: Select all

findstr /R /C:"a" /C:"b" "test.txt"
This searches for a or b within "test.txt".

But you won't find a regular expression that will work as you expected, Sponge Belly:
The LF character never can be part of such a regular expression, as findstr only works on single lines (without the ending "\r\n", or ending "\n") inner '\r' characters seems to work:

Code: Select all

rem Ending \r and \r\n will be removed, inner \r seems to be processed
rem content of test.txt
rem hex: 0D 0D 0A 61 0A 62 0D 63 0D 0A 64 0D 0A
rem str: "\r\r\na\nb\rc\r\nd\r\n"
(
   for /F "tokens=* delims=" %%a in ('findstr /R /C:"^" "test.txt"') do (
      <nul set /P "=%%a"
      pause >nul
   )
)
rem rem out pause and add >"abc.txt" after block produces:
rem "\rab\rcd" in abc.txt
So lines recognized are "\r", "a", "b\rc", "d" (tested on win xp only).

As end of file is not recognized by line end ('$') (only tested on win xp), and
if you have no "irritating" '\r' characters and your last line does not end on "\r\n",
you could do this to get the last line:

Code: Select all

findstr /V /R /C:"$" "test.txt"
(I assume this is the best you can get, with a single cmd.)

penpen

Edit: Removed some flaws.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: findstr regex

#6 Post by dbenham » 28 Mar 2014 19:01

penpen wrote:
dbenham wrote:1) FINDSTR does not support alternation (a|b), so a the only other possibility is a character set.
This is wrong: findstr supports a rudimentary form of alternation.

Code: Select all

findstr /R /C:"a" /C:"b" "test.txt"
This searches for a or b within "test.txt".

I suppose you could make that argument, but I was explicitly referencing alternation that could be enclosed in parentheses as part of a larger regex. Of course, parentheses aren't even supported by FINDSTR. :(

penpen wrote:But you won't find a regular expression that will work as you expected, Sponge Belly:
The LF character never can be part of such a regular expression, as findstr only works on single lines (without the ending "\r\n", or ending "\n") inner '\r' characters seems to work: ...

This would be a logical conclusion, but it is wrong :wink:

I demonstrate how to search across line breaks at:

1) My initial response to Sponge Belly in the 2nd post of this thread

2) My extensive documentation of FINDSTR at http://stackoverflow.com/questions/8844 ... 73#8844873. See the section titled Searching across line breaks about 2/3 down in the answer.

Also, see aGerman's response to Squashman's question at viewtopic.php?p=12961#p12961


Dave Benham

penpen
Expert
Posts: 1991
Joined: 23 Jun 2013 06:15
Location: Germany

Re: findstr regex

#7 Post by penpen » 29 Mar 2014 16:23

dbenham wrote:This would be a logical conclusion, but it is wrong

I demonstrate how to search across line breaks at:

1) My initial response to Sponge Belly in the 2nd post of this thread
I would bet, it wasn't a conclusion and that i've read it somewhere: I never neither doubted nor tested... :oops: .
Sorry for that!
I haven't noticed your first post... don't know how that happend: Big sorry for that, too!
Especially as it seems that this was the optimal possible solution.
Beside: 0x0B (vertical tab) are also not recognized by the "any character wildcard" (.) at least on win xp home;
although it is somehow strange, it is no hard requirement to text files, not to have any single 0x0D (carriage return) in a text, and not to have any 0x0B within a text.
And big thanks for the links.

penpen

Sponge Belly
Posts: 216
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: findstr regex

#8 Post by Sponge Belly » 29 Mar 2014 17:12

Hi Again! :-)

I’m still looking for that elusive regex. The snippet below assumes the specified file ends with a Line Feed:

Code: Select all

@echo off & setlocal enableextensions enabledelayedexpansion
(set lf=^

)

findstr /n "^" "%~1" | findstr /rv "!lf![^123456789]"

endlocal & goto :EOF


Looked promising… until I tested it on a 5.66Mb text file with 127,824 lines. Sure, it gave me the number and content of the last line, but 20 false positives as well. And sometimes it gave different results than other times. I think it’s a buffering issue. Can I make the second findstr in the pipe wait until the first one is finished before it starts processing?

- SB

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: findstr regex

#9 Post by dbenham » 29 Mar 2014 20:20

Brilliant idea, but ...
:shock: You really threw me for a loop for a while. I was thinking that should not work at all.

But I think I have most everything figured out.

Your implementation is not what you think it is, but it works :lol:

1) First off, it does not matter if the last line ends with line feed for two reasons:

- The regex would work regardless whether the last line ends with line feed

- Whenever data is piped into FINDSTR, <CR><LF> is automatically appended to the data anyway if the last line does not end with line feed.


2) Your regex is not what you think.

You want to find lines that are not followed by a new line beginning with a digit. The correct syntax is

Code: Select all

findstr /n "^" "%~1" | findstr /rv "!lf![123456789]"

Your code actually amounts to the same thing because delayed expansion requires caret literals to be escaped if the line also contains an exclamation point. You didn't escape it, so [^123456789] becomes [123456789]. The proper way to escape a quoted caret is "^^", but we really don't want the caret anyway :wink:

Note that the left hand side of the pipe works because the parser breaks the line into two at the pipe. The left side doesn't have an exclamation point, so the quoted caret is safe.


3) The LF variable is expanded in the main script, before the pipe is executed. I'm pleasantly surprised this works. Behind the scenes, the FINDSTR command actually becomes:

Code: Select all

cmd /c findstr /rv "<LF>[123456789]"
I would have thought passing a quoted line feed literal would have caused problems, but apparently not :? :)

The following also works, and it makes more sense to me. The expansion is delayed until when the right side of the pipe is actually executed.

Code: Select all

@echo off
setlocal disableDelayedExpansion
(set lf=^

)

findstr /n "^" "%~1" | cmd /v:on /c findstr /rv "!lf![123456789]"
In this case, the right side becomes

Code: Select all

cmd /c cmd /v:on /c findstr /rv "!lf![123456789]"


4) And now for the odd behavior with the large file. I was able to reproduce your problem with my own 12 mB file. I too was seeing inconsistent behavior.

I believe the problem has to do with FINDSTR input line length limits when piping data. When data is piped into FINDSTR, it will fail to match any line that exceeds 8191 characters. (Note, this is a feature of FINDSTR, not pipes in general) I think this is exacerbated by the fact we are attempting to search across line breaks. So even if no line exceeds 8191, the buffer may be exceeded when FINDSTR attempts to read the next line. I think your idea of a pipe buffering/timing issue may also be coming into play, hence the inconsistent behavior. I don't think it would ever be a problem if we were not searching across line breaks.

FINDSTR has no line length limit when it reads a file directly. So the solution is simple. Ditch the pipe and use a temporary file. :)

Code: Select all

@echo off
setlocal enableDelayedExpansion
(set LF=^

)

findstr /n "^" "%~1" >"%~1.temp"
findstr /rv "!LF![123456789]" "%~1.temp"
del "%~1.temp"


Dave Benham

Sponge Belly
Posts: 216
Joined: 01 Oct 2012 13:32
Location: Ireland
Contact:

Re: findstr regex

#10 Post by Sponge Belly » 31 Mar 2014 03:07

Hi Dave,

Thanks for the detailed reply. I’m still trying to process your explanation. Is there no end to findstr’s enigmatic behaviour? :?:

I was trying to avoid temporary files, but I guess sometimes there’s no other practical way.

- SB

Post Reply