HTML file parsing with batch

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
miskox
Posts: 564
Joined: 28 Jun 2010 03:46

HTML file parsing with batch

#1 Post by miskox » 16 Apr 2012 05:25

I did some testing (and with some success) of HTML parsing with batch program.

An idea was to first dump (based on this viewtopic.php?p=14361#p14361 and this viewtopic.php?p=14704#p14704) the html file and then read byte by byte and do whatever we want to do.

I needed the info between

Code: Select all

<A


and

Code: Select all

</A


To make things a little bit easier I used gsar.exe (g n u w i n 3 2 .sourceforge.net/packages/gsar.htm) and I replaced <A with 0x1 and </A with 0x2. In this way I read record by record and when I reached 0x1 and 0x2 I knew what to do and there was no need to make additional checks for the next one (two) record(s).

Yes, it is slow but I works.

Of course this is just an idea with limited implementation, maybe somebody would make a science out of it.

Saso

P.S. This post w w w .dostips.com/forum/viewtopic.php?f=3&t=3210 would help make this batch program run faster.

!k
Expert
Posts: 378
Joined: 17 Oct 2009 08:30
Location: Russia

Re: HTML file parsing with batch

#2 Post by !k » 16 Apr 2012 08:34

Code: Select all

sed -r -n "/<[Aa]\s/{s/.*<[Aa]\s//;s/<\/[Aa]>.*//;p}" index.htm

miskox
Posts: 564
Joined: 28 Jun 2010 03:46

Re: HTML file parsing with batch

#3 Post by miskox » 16 Apr 2012 09:05

!k wrote:

Code: Select all

sed -r -n "/<[Aa]\s/{s/.*<[Aa]\s//;s/<\/[Aa]>.*//;p}" index.htm


I don't have sed installed so please translate this to English :) .

My original post lacks some information: the complete info that *is* between <A and </A is not what I need. It is there but there are some other factors to check and to get just a little information.

Thanks,
Saso

Post Reply