JREPL.BAT help

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
sochuffed
Posts: 5
Joined: 20 Dec 2016 23:41

JREPL.BAT help

#1 Post by sochuffed » 21 Dec 2016 18:46

Hey guys,

So I'm using jrepl to run a find and replace on a batch of files using regular expression. This is what I have:

Code: Select all

@echo off
echo begin find and replace


for %%F in (*.txt) do (
  call jrepl "\r\n|STUDENT: [^\r\n]{1,200}|_[A-z].{1,300}|-{2,2000}"^
             "\r\n\r\n|STUDENT:||" /m /x /t "|" /f "%%F" /o -
)


:end
echo Press any key to exit.
pause > nul


The bulk of which Dave Benham gave me, though I've added my own little bits to force the .bat to pause instead of automatically closing so I can see the error messages and such that pop up.

Basically, it seems to only want to run '-{2,2000}' replaced with nothing - the rest of the regex doesn't seem to work.

If I try and edit it down to just running the first (\r\n replaced with \r\n\r\n) I get "JScript runtime error: Mismatched search and replace /T expressions" for both the files I'm testing on.

The find and replaces I'm trying to run, for better readability, are as follows:

FIND: \r\n
REPLACE: \r\n\r\n

FIND: STUDENT: [^\r\n]{1,200}
REPLACE: STUDENT:

FIND: _[A-z].{1,300}
REPLACE: [nothing]

FIND: -{2,2000}
REPLACE: [nothing]

These are designed to add double paragraph breaks, remove any student dialogue (these are transcripts of classroom recordings), find and replace essentially the first 300 characters (which is a bunch of header stuff we don't need), and then remove any dashes that aren't singular, up to 2000 in a row (also part of the header/footer of the original documents, avoiding removal of dashes in hyphenated words).

Hope all of this makes sense. Ask as many questions as you need, but I'm totally lost. (cue internal rage at employers for expecting me to be able to do things I have no training in).

sochuffed
Posts: 5
Joined: 20 Dec 2016 23:41

Re: JREPL.BAT help

#2 Post by sochuffed » 21 Dec 2016 22:45

Okay so for some reason the above works now, aside from '_[A-z].{1,300}'. If this were run in notepad++ '. match new line' would be ticked so the search can find everything that runs over a line break, not just individual lines. Currently this is removing any individual line starting with _ up to 300 characters, but I need it to remove anything INCLUDING line breaks up to 300 characters.

Alternatively, I can remove this and write in something that will simply remove the first 8 lines of my files, which I'm trying to work out but, as stated, I have no training in this. I legit don't know what I'm doing. I'm wading through a mess my brain can't comprehend 99% of the time!

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT help

#3 Post by dbenham » 22 Dec 2016 06:44

Give this a try:

Code: Select all

@echo off
echo begin find and replace

for %%F in (*.txt) do (
  call jrepl "^" "" /k 0 /exc 1:8 /f "%%F" /o -
  call jrepl "--+|STUDENT:.{1,200}|$" "||\r\n" /x /t "|" /f "%%F" /o -
)

echo Press any key to exit
pause >nul

The first JREPL line removes the first 8 lines. The second one does the rest.

Instead of using multi-line mode and matching \r\n and doubling, I instead use line mode and match on $ (denotes the end of the line) and replace with \r\n. The original \r\n of the line is preserved, so you get your doubling.

I simplified "-{2,2000}" to "--+", which matches any string of two or more hyphens.

The following code does the same thing, but may be a bit faster:

Code: Select all

@echo off
echo begin find and replace

for %%F in (*.txt) do (
  jrepl "^" "" /k 0 /exc 1:8 /f "%%F" | jrepl "--+|STUDENT:.{1,200}|$" "||\r\n" /x /t "|" /f "%%F" /o "%%F.new"
  move /y "%%F.new" "%%F" >nul
)

echo Press any key to exit
pause >nul


Dave Benham

sochuffed
Posts: 5
Joined: 20 Dec 2016 23:41

Re: JREPL.BAT help

#4 Post by sochuffed » 22 Dec 2016 16:48

You, sir, are a life saver. I'd been blundering around trying to write this stuff for about 2 weeks and you've helped me finish it off in essentially 2 days. And I appreciate that you explain everything as you go - helps me make the tweaks I need to get it working exactly how I want it to. So thank you!

My only problem now is that when I insert it into my docx to txt batch conversion file the output is ANSI and not UTF-8 like it was previously.

And now they've given me a much bigger job. Seriously, I'm a captioning coordinator not a programmer! :lol:

sochuffed
Posts: 5
Joined: 20 Dec 2016 23:41

Re: JREPL.BAT help

#5 Post by sochuffed » 22 Dec 2016 20:37

Here's what I have in my .bat:

Code: Select all

@echo off
setlocal enabledelayedexpansion

:: Confirm pandoc is installed, or check for a dropin "pandoc.exe
WHERE pandoc >nul 2>nul
IF %ERRORLEVEL% NEQ 0 (
   IF EXIST pandoc.exe (
      set pandoc=pandoc.exe
   ) ELSE (
      echo ERROR: Pandoc is not installed.
      echo  - Installation: http://pandoc.org/installing.html
      echo  - Or drop pandoc.exe into this directory
      goto:end
   )
) ELSE (
   set pandoc=pandoc
)

echo %pandoc%

:: Script welcome message
echo ------------------------------------------------------------
echo Press any key to convert "input/*.docx" to "output/*.txt"
echo ------------------------------------------------------------
pause > nul
echo.

:: We'll count the number of files converted
set /A Counter=1

:: Loop input files and convert
echo File conversions started at %date% %time%
echo.
for /r %%i in (input/*.docx) do (
   %pandoc% -t plain -s "%cd%\input\%%~nxi" -o "%cd%\output\%%~ni.txt"
   echo !Counter!. Converting "%%~nxi"
   set /A Counter+=1
)


for %%F in (output\*.txt) do (
  call jrepl "^" "" /k 0 /exc 1:8 /f "%%F" /o -
  call jrepl ":|\[[A-z]{1,100}\]?.|$|--+" "||\r\n|" /x /t "|" /f "%%F" /o -
)


echo.

:: Completion message
echo ------------------------------------------------------------
echo File conversions completed at %date% %time%
goto:end


:end
echo Press any key to exit.
pause > nul


It converts from docx to txt using pandoc (which uses utf-8), then runs the find and replace that I'd been working on (different to the ones I had help with, but copied the formula obviously)

If I've not worked out how to make it stay utf-8 by the end of the day, I'm not back at work until Jan 3 so they can wait or do it themselves! :) but any input from anyone would be greatly appreciated.

dbenham
Expert
Posts: 2461
Joined: 12 Feb 2011 21:02
Location: United States (east coast)

Re: JREPL.BAT help

#6 Post by dbenham » 22 Dec 2016 21:40

JREPL only knows how to work with ANSI (extended ASCII). It does not understand UTF-8. That being said, UTF-8 and ANSI are generally (almost) compatible. However, any single multi-byte UTF-8 character will be interpreted as a sequence of multiple extended ASCII characters. This could throw off any regular expression term that has the potential to match a non-ASCII character.

But I think your most likely problem is the first JREPL call that removes the first 8 lines also removes the UTF-8 BOM at the beginning of the file. The BOM can be re-inserted by simply adding the following 3rd JREPL call within your loop:

Code: Select all

call jrepl "^" "\xEF\xBB\xBF" /x /inc 1 /f "%%F" /o -

You seem to have removed the STUDENT: search, and replaced it with some others. I suspect that some of the regex expressions you are currently using are not exactly what you want. If you post some example input text, and what you want the final result to be, then I might be able to suggest some improvements.


Dave Benham

sochuffed
Posts: 5
Joined: 20 Dec 2016 23:41

Re: JREPL.BAT help

#7 Post by sochuffed » 22 Dec 2016 22:03

I've probably confused you seeing as I'm writing two separate files for two separate find and replace jobs.

But it's fine! I sorted it out myself by writing in some extra lines.

This is what was messing me up:

Code: Select all

\[[A-z]{1,100}\]?.


Remove the ?. and it remained utf-8, so I just had to find another way to get it to remove everything between brackets including potential standalone characters outside the closed bracket.

It may be messy but I did the following:

Code: Select all

call jrepl "^" "" /k 0 /exc 1:8 /f "%%F" /o -
  call jrepl ":|$|--+" "|\r\n|" /x /t "|" /f "%%F" /o -
  call jrepl "\[[A-z]{1,100}\]\.|\?" "" /x /t ":" /f "%%F" /o -
  call jrepl "\[[A-z]{1,100}\]" "" /x /t "|" /f "%%F" /o -


I've been working with transcripts of classroom recordings. One of the conversions needed to be suitable for an internal program we have that aligns transcripts to dialogue and produces a rough caption file, which has to be UTF-8 and can't have anything in brackets or any weird stand alone characters attributed to a speaker. The second conversion needed to be a simple, clean transcript that removed all student dialogue (for legal reasons) and improved aesthetics/readability. The student dialogue was the one I was initially asking for help with and I've edited it to do the other task too, which is the one that presented the UTF-8 issue.

There's probably a cleaner way to write this, but to be honest, it does what I want it to do AND I understand why it does what it does so that's all I need!

Thanks again for all your help. No doubt I'll be back with 10000 more questions in the new year.

Post Reply