Clean html source file

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Clean html source file

#1 Post by darioit » 22 Dec 2012 10:04

It's possible clean html file from:

H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>
H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-2.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-2.iso.html</a>


to:

http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso
http://cdimage.debian.org/debian-cd/6.0 ... 6-CD-2.iso


Regards
Dario

abc0502
Posts: 1007
Joined: 26 Oct 2011 22:38
Location: Egypt

Re: Clean html source file

#2 Post by abc0502 » 22 Dec 2012 10:11

Is it only like that or this is just a sample of the html file?

If you accept using 3rd Party tool, You can Use the Regular Expressions with SED to do that.
or Consider another language like python with the help of "BeautifulSoup" library

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Clean html source file

#3 Post by darioit » 22 Dec 2012 10:17

what I need is exactly like that example, but with many line

I can use also Sed or awk to manipulate string

abc0502
Posts: 1007
Joined: 26 Oct 2011 22:38
Location: Egypt

Re: Clean html source file

#4 Post by abc0502 » 22 Dec 2012 11:48

If it as you say just like this but has many lines, it can be done by using only batch files, but just a question,
is this words "H:\www.google.com\list.html:" part of the content of the html file or not ?

and is it in one line, i mean each two lines from the above are one line?

abc0502
Posts: 1007
Joined: 26 Oct 2011 22:38
Location: Egypt

Re: Clean html source file

#5 Post by abc0502 » 22 Dec 2012 12:21

This code Assumes that each two lines are in fact one.

Change the location of the file variable "2nd line" to match the location of your file.

Code: Select all

@echo off
Set "file=D:\test\file.txt"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
   Set "line=%%a"
   Echo !line:~8,-1!>>Modified_list.txt
   )
pause
Here we use the space as a delims and then remove the first 8 characters and the last quote sign.
and will generate a list of the link in a txt file in the same location the batch file exist.

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Clean html source file

#6 Post by darioit » 22 Dec 2012 12:56

abc0502 wrote:If it as you say just like this but has many lines, it can be done by using only batch files, but just a question,
is this words "H:\www.google.com\list.html:" part of the content of the html file or not ?

and is it in one line, i mean each two lines from the above are one line?


It's a source of code "H:\www.google.com\list.html" before I made a

findstr.exe "\<iso*" C:\http_source.txt > list.txt

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Clean html source file

#7 Post by darioit » 22 Dec 2012 12:59

abc0502 wrote:This code Assumes that each two lines are in fact one.

Change the location of the file variable "2nd line" to match the location of your file.

Code: Select all

@echo off
Set "file=D:\test\file.txt"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
   Set "line=%%a"
   Echo !line:~8,-1!>>Modified_list.txt
   )
pause
Here we use the space as a delims and then remove the first 8 characters and the last quote sign.
and will generate a list of the link in a txt file in the same location the batch file exist.


Sorry my goal is to clean this raw

H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>

to this:

http://cdimage.debian.org/debian-cd/6.0 ... 6-CD-1.iso

without
H:\www.google.com\list.html:<a href="

and without
" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>


But the problem is not the fix position, the lenght could change in each line

darioit
Posts: 230
Joined: 02 Aug 2010 05:25

Re: Clean html source file

#8 Post by darioit » 22 Dec 2012 13:08

Now is better

call clean.bat fileinp.txt fileout.txt

Code: Select all

@echo off
Set "file=%1"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
   Set "line=%%a"
   Echo !line:~6,-1!>>%2
   )
pause


Thanks to all
Regards
Dario

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Clean html source file

#9 Post by foxidrive » 22 Dec 2012 17:03

This free tool is useful to extract URLS:

Geturls.exe

Version 1.0, Copyright (C)2001 Frank P. Westlake
Extracts URL's beginning with "http://" or "ftp://" from a stream and
prints to STDOUT.

program | GetURLs [/e:"string"] [/s:"string"]

/e Additional ending delimiters.
/s An additional beginning string to match.

The URL is expected to be ended with white space, a double quote, one of the
angle brackets '<' or '>', or one of the characters in the string identified by
the /e switch.

Squashman
Expert
Posts: 4488
Joined: 23 Dec 2011 13:59

Re: Clean html source file

#10 Post by Squashman » 22 Dec 2012 17:37

Frank has made a lot of good tools over the years.

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: Clean html source file

#11 Post by foxidrive » 22 Dec 2012 19:36

+1 on that

Post Reply