Using RegEx on a pair of lines at same time?

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
MicrosoftIsKillingMe
Posts: 55
Joined: 11 Dec 2017 09:08

Using RegEx on a pair of lines at same time?

#1 Post by MicrosoftIsKillingMe » 09 Feb 2021 22:26

I have a .BAT that reports every found "a.txt" and directory names via something like
dir /a /od /s *.* | FINDSTR /I /R /C:"a.txt" /C:" Directory of "
(As you maybe guessed I'm using this construct to avoid "false positives" on short file names. It's actually more complex than this but this will suffice for my question. I actually have %1 et al logic instead of a.txt, and /R actually does get used.)
Unfortunately since I know no DIR form that shows paths I ask for both filespec matching lines and every DIR line.
Output is like

Code: Select all

 Directory of C:\
 Directory of C:\AL
 Directory of C:\BOB
06/30/2017  03:37 AM               730 a.txt 
 Directory of C:\CAL
 Directory of C:\DOG
Can I exclude - or even just detect - consecutive lines that begin with { Directory of } ? Goal is to eliminate the first of each pair. So rid the C:\, \AL, \CAL lines but preserve the \BOB line.

Just to attempt to identify them (excluding is another matter!) I tried appending a pipe
| findstr /R /C:"^ Dir.*$^ Dir"
seeking to via RegEx report lines beginning with { Dir} that immediately after their EOL (the $) have a BOL (the ^) and { Dir}. As expected it fails even without the seemingly redundant ^.

More basically, are there tricks available with two lines, such that you can seek a line with something, then "peek forward" for something on the BOL of the next line? If I must concede that FINDSTR only works on one line at a time, any other approach? Thx. Win 10, Win 7, XP.

(If not, my grotesque fallback is to redirect to a file and run a trivial VBA macro in Excel who is always up. For this "project" my only programming avenues are batch files and VBA.)

MicrosoftIsKillingMe
Posts: 55
Joined: 11 Dec 2017 09:08

Re: Using RegEx on a pair of lines at same time?

#2 Post by MicrosoftIsKillingMe » 09 Feb 2021 23:52

Note I found Dave's https://ss64.com/nt/findstr-linebreaks.html "FINDSTR - Searching across Line Breaks" but it is confusing where he says "I can't graphically represent the characters" and I don't understand why he shows !CR!*!LF! instead of $^ but anyway that didn't work either - it did not report anything. Note that the DIR output puts the D in Directory on position 2, thus I precede Dir with a space above.
" Dir".*$ Dir"
" Dir".*$^ Dir"

I'm also having a separate problem with my weak understanding of regular expressions. I thought
findstr /I /R /C:"M.*[^ ].*[ ]5"
would find the first letter M on a line, then the next nonspace after that, then the next space after that. But both [^ ] and [ ] seem to be ignored, so it just finds the the first 5 after the M.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Using RegEx on a pair of lines at same time?

#3 Post by dbenham » 10 Feb 2021 15:13

Best to work on file systems that have short names disabled - they generally aren't needed unless you deal with very old lecacy programs. I imagine that would greatly simplify your problem. But sometimes you don't have control over the file system configuration.
MicrosoftIsKillingMe wrote:I know no DIR form that shows paths
The DIR /B option displays the full path of each file, but does not give timestamps or file sizes. If you don't need those values and/or you don't really need to sort by timestamp, then it should serve you well.

The FINDSTR search across linebreaks technique does work.

The code below finds all lines that precede a line ending with " a.txt", as well as all lines that end " a.txt"

Code: Select all

@echo off
setlocal
::Define LF variable containing a linefeed (0x0A)
set ^"LF=^

^" ::Above blank line is critical - do not remove

::Define CR variable containing a carriage return (0x0D)
for /f %%a in ('copy /Z "%~dpf0" nul') do set "CR=%%a"

setlocal enableDelayedExpansion
dir /a /od /s *.* | findstr /i /r /c:"^ Directory of" /c:" a\.txt$" | findstr /i /r /c:"!cr!!lf!.* a\.txt$" /c:" a\.txt$"
It seems like the following change to the last line ought to work just as well. It should return all lines that precede a line beginning with a 0 or 1 (the beginning of the date field's month), as well as any line that begins with 0 or 1. But there appears to be a new FINDSTR bug that I have not documented yet. The following partially works, but gives some extra unwanted folders in my hands:

Code: Select all

REM This does not work, even though it should!
dir /a /od /s *.* | findstr /i /r /c:"^ Directory of" /c:" a\.txt$" | findstr /i /r /c:"!cr!!lf![01]" /c:"^[01]"
But I imagine there is a better way to solve your problem, if only you fully described what you are really trying to accomplish. I'm assuming a given folder may return multiple files that you are interested in. The DIR /OD will sort by last modified date only within each folder, not across folders. Maybe that is what you want, but it seems unlikely.

You probably could use WMIC to accomplish your goal, but dealing with paths and timestamps is a pain in the ass with that system.

I suspect you could get what you want very elegantly with my JREN.BAT utility. Short names do not interfere with its use. It was initially designed to provide sophisticated file/folder renaming capabilities using regular expressions, but it can also be used to search and format directory listings in most any way you want. To take advantage of that capability you must have some rudimentary JScript knowledge.

For example, the following will find all examples of "a.txt" in the folder hierarchy rooted at "c:\test", sorted by last modified date.

Code: Select all

jren "^" "ts({dt:'modified',fmt:'{iso-ts} '}) + parent() + '\\'" /list /s /p "c:\test" /fm "a.txt" /j | sort
Output would look something like

Code: Select all

2018-08-31T14:10:27.783-04:00 C:\test\viboras\a.txt
2019-04-08T10:33:58.160-04:00 C:\test\this and that\a.txt
2019-05-21T14:49:08.432-04:00 C:\test\xyz\a.txt
2020-04-06T10:51:47.538-04:00 C:\test\test\a.txt
2021-02-10T11:56:55.131-05:00 C:\test\a.txt

One obscure issue is the sort order could be wrong if you had multiple timestamps that were around 2am on a transition day between daylight savings and standard time. That could be fixed by using UTC times instead.

Code: Select all

jren "^" "ts({dt:'modified',tz:0,fmt:'{iso-ts} '}) + parent() + '\\'" /list /s /p "c:\test" /fm "a.txt" /j | sort
--OUTPUT--

Code: Select all

2018-08-31T18:10:27.783+00:00 C:\test\viboras\a.txt
2019-04-08T14:33:58.160+00:00 C:\test\this and that\a.txt
2019-05-21T18:49:08.432+00:00 C:\test\xyz\a.txt
2020-04-06T14:51:47.538+00:00 C:\test\test\a.txt
2021-02-10T16:56:55.131+00:00 C:\test\a.txt
As long as your locale formats timestamps in a way that can be parsed by JScript, then you can improve performance by using dt:'fsomodified' instead of dt:'modified'. That form will truncate all timestamps to seconds (milliseconds will be .000), and is much faster. But it does not work in some locales.

One of the nice features of JREN is you can use standard wildcards in your file mask. For example, an option of /FM "t*t.txt" specifies all .txt files whose base name begins and ends with t. Or if that is not sophisticated enough, you can use a Jscript (ecma) regular expression. For example, /RFM "^t.*t\.txt$" is the equivalent of /FM "t*t.txt"

Suppose you wanted to find all t*t.txt files, sorted by file name, then modified date. For this I pad each file name to a constant width to make it easy to read and sort.

Code: Select all

jren "^.*" "name('                            ')+ts({dt:'modified',fmt:'{iso-ts} '})+parent()+'\\'" /list /s /fm "t*t.txt" /j | sort
--OUTPUT--

Code: Select all

test.bat.txt                2020-01-29T18:13:59.689-05:00 C:\test\
test.txt                    2018-08-31T14:10:27.783-04:00 C:\test\viboras\
test.txt                    2019-04-08T10:33:58.160-04:00 C:\test\this and that\
test.txt                    2019-05-21T14:49:08.432-04:00 C:\test\xyz\
test.txt                    2020-04-06T10:51:47.538-04:00 C:\test\test\
test.txt                    2021-02-10T11:56:55.131-05:00 C:\test\
tsc_call_layout.txt         2015-06-22T15:11:24.583-04:00 C:\test\
Use JREN /? to get help, and JREN /?ts() to get help on all the many options of the ts() (timestamp) function.


Dave Benham

MicrosoftIsKillingMe
Posts: 55
Joined: 11 Dec 2017 09:08

Re: Using RegEx on a pair of lines at same time?

#4 Post by MicrosoftIsKillingMe » 10 Feb 2021 17:00

What a work of art! I'll need to come back to this when I can put in some time but I'll mention a few things -

Yes, /OD within each folder is what I'm accustomed to and my initial intent was to see the /OD grouped sorts, though the other everything at once way SOMETIMES is what I prefer (which I've at least gotten with varying frustration from windows search).

My full objective is to show every DIR entry for long file names that contains "a.txt" including date and size, and the directory.
Another variant I considered for later on might find exact a.txt; but as written, it would catch "ba.txt" with the findstring.

Yes, I misspoke, DIR/B does shows path (IF /S specified), but yes, I'd like path AND date/time AND size.

I'll try again your way with the set variables, though disappointed that simple $ and ^ don't work.
By the way, this time I can follow you better. The confusing part on the other webpage was the /G part with 0x0A bracketing which I should have just disregarded.

My undeveloped thought on my shown attempt was that if I could even identify the pair here
Directory of C:\AL
Directory of C:\BOB
... a.txt
then maybe use that logic in another pipe, so that (if possible?) I'd return the _complement_ of such matches, effectively ridding the line for AL, which I don't want since there is no AL\a.txt.
I foresaw that it might still be unachievable even if I got the identification and complement logic, as it might not resolve 3 or more consecutive " Directory of " lines.

Meanwhile at the end of my second thread message I'm perplexed. Even on SS I've been unable to locate a good explanation or any examples of findstr's "character class" though there's extensive discussion of ranges like [0-9]; so I even went to trying [ - ] (space dash space) and even [\ -\ ] . I was expecting that the nonrange [ ] or [\ ] would find a space (and adding a carat would find a nonspace) but nope. Scratching my head I now note one thing I didn't do that you're doing both here and the other page ... delayed expansion.

JREN appears impressive. I'll need to spend some time enjoying that. (I started out with something close to that using FORFILES though one snag I hit was not catching hidden and system files.)

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: Using RegEx on a pair of lines at same time?

#5 Post by dbenham » 10 Feb 2021 20:11

You can easily add the file size to the JREN output by incorporating the size() function into the replacement value. You can left pad the number to a constant width by passing in a string of spaces with the desired width. For example:

Code: Select all

size("               ")
pads the size to length 15 (no commas)

FINDSTR character classes work mostly as you expect. Your problem is that "[ ]" is treated as two separate find expression, "[" and "]" because by default FINDSTR supports multiple expressions, delimited by space. If you want to include a space as a literal, then you must use the /C:search option. That defaults to a literal search. But you can override that with the /R option to force a regular expression search.

So FINDSTR /R /C:"[ ]" will indeed match a space, and FINDSTR /R /C:"[^ ]" will match any character other than space.

The other gotcha is described in my StackOverfow post that includes a discussion of class ranges. They are not collated as you expect. The SO post gives the sequence, but it is generally simpler and more reliable to explicitly list all characters in the range.

For example [A-Z] does not represent upper case letter only. For that you would need to use [ABCDEFGHIJKLMNOPQRSTUVWXYZ].

Even [0-9] is not as you would expect because it also matches some simple fractional value characters. Though that does not typically cause problems, the precise expression should be [0123456789].


Dave Benham

MicrosoftIsKillingMe
Posts: 55
Joined: 11 Dec 2017 09:08

Re: Using RegEx on a pair of lines at same time?

#6 Post by MicrosoftIsKillingMe » 11 Feb 2021 02:32

Just a quick note, suddenly underwater here on other projects...

Thank you for your [as always] steady, precise and thorough, and insightful performance. What can I say.

I solved my issue with character class -- user error, surprise, surprise. And I can explain why I reached the silly-sounding conclusion that [ ] was being ignored.
echo "AM 730 a.txt" | findstr /I /R /C:"M.*[^ ].*[ ]7*"
produces
AM 730 a.txt
What?! So I interpreted it as a character class failure. I read my /C string as saying:find the M, then a nonspace (the 7), then a space(the one after the a) and then a 7 - or in fact, any number of 7s (7*). Since it [undesirably] returns the string, it appeared as though it was just ignoring my [^ ] and [ ] and just finding the first 7.

I now conclude that my downfall was the brain interpreting "0 or more" as "1 or more". As I NOW read it, "7*" or even "2*" will always "match". Now I wonder why one would ever bother to use * in RegEx except following a period - right? (And if so, why even require the period anyway, vs. doing it like filespecs!) At any rate, removing my * at the end gives the expected results, or changing "*" to ".*" (though ending the /C string with ".*" seems redundant).

Reading "zero or more" as "one or more" is a mistake I've made often and for years. My brain wants to think 7* means any number of 7s in RegEx. Sorry to make you suffer for my little disability there. By the way my brain has to stop and ponder whenever a RegEx situation calls for ".*" when my inclination is just to say "*". That likely originated with filespecs, where my understanding is that "*" means what ".* means in RegEX, while specifying a period with filespecs refers to a 'literal' period. I suppose my issue would never have arose if I could just 'foolproofedly' train myself on a mental rule to use * in filespecs; use .* in RegEx. (Unless, as I'm bracing to hear, that my conclusion is not strictly true)

MicrosoftIsKillingMe
Posts: 55
Joined: 11 Dec 2017 09:08

Re: Using RegEx on a pair of lines at same time?

#7 Post by MicrosoftIsKillingMe » 11 Feb 2021 02:47

BTW, the reason I ended up with "7*" anyway was from a batch %1. The user went
foo.bat 7*
under the understandable notion that it would return anything with a 7 in it.

Methinks I'll need to disallow asterisks in %1 for this project.There's too much risk that the user will either use .* when * is necessary, or vice versa. If I were to allow them, I suppose I would want to alter "any asterisk following a nonperiod" to "a {period and asterisk} following that nonperiod" but I'm inclined to chicken out by disallowing.

Post Reply