Discussion forum for all Windows batch related topics.
Moderator: DosItHelp
-
darioit
- Posts: 230
- Joined: 02 Aug 2010 05:25
#1
Post
by darioit » 22 Dec 2012 10:04
It's possible clean html file from:
H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>
H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-2.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-2.iso.html</a>
to:
Regards
Dario
-
abc0502
- Posts: 1007
- Joined: 26 Oct 2011 22:38
- Location: Egypt
#2
Post
by abc0502 » 22 Dec 2012 10:11
Is it only like that or this is just a sample of the html file?
If you accept using 3rd Party tool, You can Use the Regular Expressions with SED to do that.
or Consider another language like python with the help of "BeautifulSoup" library
-
darioit
- Posts: 230
- Joined: 02 Aug 2010 05:25
#3
Post
by darioit » 22 Dec 2012 10:17
what I need is exactly like that example, but with many line
I can use also Sed or awk to manipulate string
-
abc0502
- Posts: 1007
- Joined: 26 Oct 2011 22:38
- Location: Egypt
#4
Post
by abc0502 » 22 Dec 2012 11:48
If it as you say just like this but has many lines, it can be done by using only batch files, but just a question,
is this words "H:\www.google.com\list.html:" part of the content of the html file or not ?
and is it in one line, i mean each two lines from the above are one line?
-
abc0502
- Posts: 1007
- Joined: 26 Oct 2011 22:38
- Location: Egypt
#5
Post
by abc0502 » 22 Dec 2012 12:21
This code Assumes that each two lines are in fact one.
Change the location of the file variable "
2nd line" to match the location of your file.
Code: Select all
@echo off
Set "file=D:\test\file.txt"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
Set "line=%%a"
Echo !line:~8,-1!>>Modified_list.txt
)
pause
Here we use the space as a delims and then remove the first 8 characters and the last quote sign.
and will generate a list of the link in a txt file in the same location the batch file exist.
-
darioit
- Posts: 230
- Joined: 02 Aug 2010 05:25
#6
Post
by darioit » 22 Dec 2012 12:56
abc0502 wrote:If it as you say just like this but has many lines, it can be done by using only batch files, but just a question,
is this words "H:\www.google.com\list.html:" part of the content of the html file or not ?
and is it in one line, i mean each two lines from the above are one line?
It's a source of code "H:\www.google.com\list.html" before I made a
findstr.exe "\<iso*" C:\http_source.txt > list.txt
-
darioit
- Posts: 230
- Joined: 02 Aug 2010 05:25
#7
Post
by darioit » 22 Dec 2012 12:59
abc0502 wrote:This code Assumes that each two lines are in fact one.
Change the location of the file variable "
2nd line" to match the location of your file.
Code: Select all
@echo off
Set "file=D:\test\file.txt"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
Set "line=%%a"
Echo !line:~8,-1!>>Modified_list.txt
)
pause
Here we use the space as a delims and then remove the first 8 characters and the last quote sign.
and will generate a list of the link in a txt file in the same location the batch file exist.
Sorry my goal is to clean this raw
H:\www.google.com\list.html:<a href="//http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>
to this:
http://cdimage.debian.org/debian-cd/6.0 ... 6-CD-1.isowithout
H:\www.google.com\list.html:<a href="
and without
" target="_blank">http://cdimage.debian.org/debian-cd/6.0.6/i386/iso-cd/debian-6.0.6-i386-CD-1.iso.html</a>
But the problem is not the fix position, the lenght could change in each line
-
darioit
- Posts: 230
- Joined: 02 Aug 2010 05:25
#8
Post
by darioit » 22 Dec 2012 13:08
Now is better
call clean.bat fileinp.txt fileout.txt
Code: Select all
@echo off
Set "file=%1"
Setlocal EnableDelayedExpansion
For /F "tokens=2 delims= " %%a in ('TYPE "%file%"') Do (
Set "line=%%a"
Echo !line:~6,-1!>>%2
)
pause
Thanks to all
Regards
Dario
-
foxidrive
- Expert
- Posts: 6031
- Joined: 10 Feb 2012 02:20
#9
Post
by foxidrive » 22 Dec 2012 17:03
This free tool is useful to extract URLS:
Geturls.exe
Version 1.0, Copyright (C)2001 Frank P. Westlake
Extracts URL's beginning with "http://" or "ftp://" from a stream and
prints to STDOUT.
program | GetURLs [/e:"string"] [/s:"string"]
/e Additional ending delimiters.
/s An additional beginning string to match.
The URL is expected to be ended with white space, a double quote, one of the
angle brackets '<' or '>', or one of the characters in the string identified by
the /e switch.
-
Squashman
- Expert
- Posts: 4488
- Joined: 23 Dec 2011 13:59
#10
Post
by Squashman » 22 Dec 2012 17:37
Frank has made a lot of good tools over the years.