please help extracting links from a source-code of html file

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Post Reply
Message
Author
bars143
Posts: 87
Joined: 01 Sep 2013 20:47

please help extracting links from a source-code of html file

#1 Post by bars143 » 25 Sep 2013 02:27

hi to all online experts,

can you help me create a .bat script using a partial one-line source-code from html:


</div><div>The following commands are available in <b>FBCMD</b>:Â </div><br /><div><table border="1" bordercolor="#888" cellspacing="0" style="border-collapse:collapse;border-top-color:rgb(136,136,136);border-right-color:rgb(136,136,136);border-bottom-color:rgb(136,136,136);border-left-color:rgb(136,136,136);border-top-width:1px;border-right-width:1px;border-bottom-width:1px;border-left-width:1px"><tbody><tr><td style="width:117px;height:19px">Â <a href="http://fbcmd.dtompkins.com/commands/addalbum">ADDALBUM</a></td><td style="width:530px;height:19px">Â Create a new photo album</td></tr><tr><td>Â <a href="http://fbcmd.dtompkins.com/commands/addperm">ADDPERM</a></td><td>Â (Launch a website to) grant <b>FBCMD</b> extended permissions.</td></tr><tr><td style="width:117px;height:19px">Â <a href="http://fbcmd.dtompkins.com/commands/addpic">ADDPIC</a></td><td style="width:530px;height:19px">Â Upload (add) a photo to an album</td></tr><tr><td>Â <a href="http://fbcmd.dtompkins.com/commands/addpicd">ADDPICD</a></td><td>Â Upload (add) all *.jpg files in a directory to an album</td></tr><tr><td style="width:117px;height:19px">Â <a href="http://fbcmd.dtompkins.com/commands/albums">ALBUMS</a></td><td style="width:530px;height:19px">Â List all your photo albums (or for your friends)</td></tr><tr><td style="width:117px;height:19px">Â <a href="http://fbcmd.dtompkins.com/commands/allinfo">ALLINFO</a></td><td style="width:530px;height:19px">Â List all available profile information for friend(s)</td></tr>


and extract all links to download-list.txt to be used for my wget.exe :

Code: Select all

wget -i download-list.txt


the one example of all links i like to extract is :

Code: Select all

"http://fbcmd.dtompkins.com/commands/allinfo"





thanks in advance for a reply and answer. :)

from Philippine

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: please help extracting links from a source-code of html

#2 Post by Aacini » 25 Sep 2013 05:24

You may use my FindRepl.bat program to do that, for example:

Code: Select all

C:> < input.html FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 > download-list.txt

C:> type download-list.txt
 "http://fbcmd.dtompkins.com/commands/addalbum"
 "http://fbcmd.dtompkins.com/commands/addperm"
 "http://fbcmd.dtompkins.com/commands/addpic"
 "http://fbcmd.dtompkins.com/commands/addpicd"
 "http://fbcmd.dtompkins.com/commands/albums"
 "http://fbcmd.dtompkins.com/commands/allinfo"


You may copy FindRepl.bat program from this site.

Antonio

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: please help extracting links from a source-code of html

#3 Post by foxidrive » 25 Sep 2013 06:55

That's a good plan Aacini. I tried it and got slightly different results. (wrapped on the dos window)

Code: Select all

d:\abc>FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 <file.txt >a

d:\abc>type a
 "http://fbcmd.dtompkins.com/commands/addalbum" "http://fbcmd.dtompkins.com/commands/addperm" "http:
//fbcmd.dtompkins.com/commands/addpic" "http://fbcmd.dtompkins.com/commands/addpicd" "http://fbcmd.d
tompkins.com/commands/albums" "http://fbcmd.dtompkins.com/commands/allinfo"

I'm curious how the surrounding quotes are included in the matches - does findrepl always return the outer quotes?

bars143
Posts: 87
Joined: 01 Sep 2013 20:47

Re: please help extracting links from a source-code of html

#4 Post by bars143 » 25 Sep 2013 07:15

great thanks Aacini,
got it !!!
as shown below:

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\Jocelyn\Desktop2>< input.html FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 > download-list.txt

C:\Users\Jocelyn\Desktop2>type download-list.txt
"http://fbcmd.dtompkins.com/commands/addalbum"
"http://fbcmd.dtompkins.com/commands/addperm"
"http://fbcmd.dtompkins.com/commands/addpic"
"http://fbcmd.dtompkins.com/commands/addpicd"
"http://fbcmd.dtompkins.com/commands/albums"
"http://fbcmd.dtompkins.com/commands/allinfo"

C:\Users\Jocelyn\Desktop2>


now that's easy !!!

and would like to study more now... even though my netbook capacity is gprs speed with use of nokia s40 phone as modem for my only netbook.
thats why i used mobile format of desktop browsers to surf internet at gprs speed and i found wget.exe as fast downloader and fbcmd.bat for fast post to facebook status.

anyway a cheers !!! to dostips !!! :D

simply bars

Endoro
Posts: 244
Joined: 27 Mar 2013 01:29
Location: Bozen

Re: please help extracting links from a source-code of html

#5 Post by Endoro » 25 Sep 2013 07:40

wget & grep

Code: Select all

wget -O- "http://fbcmd.dtompkins.com/commands/allinfo" 2>nul|grep -Eo "http://fbcmd.(\w+).(\w+)(/\w+)+"


this needs some more work:

Code: Select all

http://fbcmd.dtompkins.com/_/rsrc/1301314156558/config/fbcmd138x52
http://fbcmd.dtompkins.com/parameters/flist
http://fbcmd.dtompkins.com/syntax
http://fbcmd.dtompkins.com/preferences/parameter
http://fbcmd.dtompkins.com/preferences/general
http://fbcmd.dtompkins.com/_/tz

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: please help extracting links from a source-code of html

#6 Post by Aacini » 25 Sep 2013 12:19

foxidrive wrote:That's a good plan Aacini. I tried it and got slightly different results. (wrapped on the dos window)

Code: Select all

d:\abc>FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 <file.txt >a

d:\abc>type a
 "http://fbcmd.dtompkins.com/commands/addalbum" "http://fbcmd.dtompkins.com/commands/addperm" "http:
//fbcmd.dtompkins.com/commands/addpic" "http://fbcmd.dtompkins.com/commands/addpicd" "http://fbcmd.d
tompkins.com/commands/albums" "http://fbcmd.dtompkins.com/commands/allinfo"

I'm curious how the surrounding quotes are included in the matches - does findrepl always return the outer quotes?

I suggest you to copy again the program, it is possible that you have a previous version that show submatched subexpressions in the same line.

"each subexpression is shown enclosed in quotes or in the character given in /Q switch". For example:

Code: Select all

C:> < input.html FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 /Q:
 http://fbcmd.dtompkins.com/commands/addalbum
 http://fbcmd.dtompkins.com/commands/addperm
 http://fbcmd.dtompkins.com/commands/addpic
 http://fbcmd.dtompkins.com/commands/addpicd
 http://fbcmd.dtompkins.com/commands/albums
 http://fbcmd.dtompkins.com/commands/allinfo

Antonio

foxidrive
Expert
Posts: 6031
Joined: 10 Feb 2012 02:20

Re: please help extracting links from a source-code of html

#7 Post by foxidrive » 25 Sep 2013 17:55

Aacini wrote:I suggest you to copy again the program, it is possible that you have a previous version that show submatched subexpressions in the same line.


That fixed it. I had a version from early July. Cheers

bars143
Posts: 87
Joined: 01 Sep 2013 20:47

Re: please help extracting links from a source-code of html

#8 Post by bars143 » 25 Sep 2013 23:32

thanks Aacini for other code :

Code: Select all

< input.html FindRepl "http://(\w+).(\w+).(\w+)(/\w+)+" /$:0 /Q:



that really excluded space or quotes only ?
but i will try soon...


anyway i already made another bat script to remove space and quote as wget will not work with quoted URL's in a text file.


sorry for late reply as i struggle a wrong password when i logged yesterday.

bars :D

------------------------------------------------------------------------------------
edited:

does work even with white space in front of URL's in the textfile for use in downloading with wget.exe

thanks and cheers !!! :D

bars

Post Reply