How to extract data from website?

Discussion forum for all Windows batch related topics.

Moderator: DosItHelp

Message
Author
PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

How to extract data from website?

#1 Post by PaperTronics » 30 Apr 2017 08:01

Hey Guys!

I wanted to extract all the links beginning with "http://www.mediafire.com" in this:http://www.mediafire.com/file/4yks2b0u18auy69/Doc.txt
file. I tried using findstr and find command but it won't do the trick.

Help plz!

Thanks,
PaperTronics
Last edited by PaperTronics on 21 May 2017 01:14, edited 1 time in total.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: How to extract data from website?

#2 Post by aGerman » 30 Apr 2017 11:13

You need a utility that supports Regular Expressions better than FINDSTR. Either use a 3rd party or elsewise I'm virtually certain dbenham's JREPL hybrid batch will work, too.
viewtopic.php?f=3&t=6044

Steffen

igor_andreev
Posts: 16
Joined: 25 Feb 2017 12:55
Location: Russia

Re: How to extract data from website?

#3 Post by igor_andreev » 30 Apr 2017 15:15

Code: Select all

grep -P -o "http\:\/\/www\.mediafire\.com[^\x22]*" Doc.txt 

or

Code: Select all

type Doc.txt | geturls | find "mediafire"


geturls.zip(~32kb) here http://ss64.net/westlake/nt/index.html
Last edited by igor_andreev on 01 May 2017 06:57, edited 1 time in total.

aGerman
Expert
Posts: 4654
Joined: 22 Jan 2010 18:01
Location: Germany

Re: How to extract data from website?

#4 Post by aGerman » 01 May 2017 06:14

Using JREPL

Code: Select all

@echo off &setlocal
cmd /c ""jrepl.bat" "\bhttp://www\.mediafire\.com[^^\x22]*" "" /F "Doc.txt" /I /MATCH"
pause


also possible

Code: Select all

@jrepl.bat "\bhttp://www\.mediafire\.com[^\x22]*" "" /F "Doc.txt" /O "mediafire.txt" /I /MATCH


Steffen

Thor
Posts: 43
Joined: 31 Mar 2016 15:02

Re: How to extract data from website?

#5 Post by Thor » 01 May 2017 10:29

Code: Select all

@echo off
setlocal enableDelayedExpansion
for /f "tokens=*" %%i in (url.txt) do (
  set "line=%%i"
  for /l %%k in (1 1 20) do (
    for /F "tokens=1* delims= " %%A in ("!line!") do (
      set "nextToken=%%A"
      if "!nextToken:~7,17!" == "www.mediafire.com" echo %%A
      set "line=%%B"
)))
endlocal
exit /b

"url.txt" file:

Code: Select all

This is line 1 ab: http://www.mediafire.com/file/1yks2b0u18auy01/Doc.htm This is line 1 end.
This is line 2: http://www.abc.com/file/4yks2b0u18auy69/Doc.txt
This is line 3 ab cd: http://www.mediafire.com/file/2yks2b0u18auy02/Doc.bmp This is line 2 end.
This is line 4: http://www.def.com/file/4yks2b0u18auy69/Doc.txt
This is line 5 ab cd ef: http://www.mediafire.com/file/3yks2b0u18auy03/Doc.gif This is line 3 end.
This is line 6: http://www.ghi.com/file/4yks2b0u18auy69/Doc.txt
This is line 7 ab cd ef gh: http://www.mediafire.com/file/4yks2b0u18auy04/Doc.jpg This is line 4 end.
This is line 8: http://www.jkl.com/file/4yks2b0u18auy69/Doc.txt
This is line 9 ab cd ef gh ij: http://www.mediafire.com/file/5yks2b0u18auy05/Doc.png This is line 5 end.
This is line 10: http://www.mno.com/file/4yks2b0u18auy69/Doc.txt
This is line 11 ab cd ef gh ij kl: http://www.mediafire.com/file/6yks2b0u18auy06/Doc.tif This is line 6 end.
This is line 12: http://www.pqr.com/file/4yks2b0u18auy69/Doc.txt
This is line 13 ab cd ef gh ij kl mn: http://www.mediafire.com/file/7yks2b0u18auy07/Doc.docx This is line 7 end.
This is line 14: http://www.stu.com/file/4yks2b0u18auy69/Doc.txt
This is line 15 ab cd ef gh ij kl mn op: http://www.mediafire.com/file/8yks2b0u18auy08/Doc.xlsx This is line 8 end.
This is line 16: http://www.wxy.com/file/4yks2b0u18auy69/Doc.txt
This is line 17 ab cd ef gh ij kl mn op qr: http://www.mediafire.com/file/9yks2b0u18auy09/Doc.ptsx This is line 9 end.
This is line 18: http://www.zab.com/file/4yks2b0u18auy69/Doc.txt
This is line 19 ab cd ef gh ij kl mn op qr st: http://www.mediafire.com/file/10yks2b0u18auy10/Doc.txt This is line 10 end.
This is line 20: http://www.zde.com/file/4yks2b0u18auy69/Doc.txt
Last edited by Thor on 03 May 2017 17:09, edited 3 times in total.

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: How to extract data from website?

#6 Post by Aacini » 01 May 2017 11:43

The 3-lines Batch file below (save it with .BAT extension) takes less than 1 second to generate the output file with the 56 result lines from your data:

Code: Select all

@set @a=0 // & cscript //nologo //E:JScript "%~F0" < Doc.txt > output.txt & goto :EOF

var search = /http:\/\/www\.mediafire\.com[^"]*/g, file = WScript.StdIn.ReadAll(), match;
while ( match = search.exec(file) ) WScript.Stdout.WriteLine(match[0]);

Output:

Code: Select all

http://www.mediafire.com/file/dbu0pgraknjfma3/Snaper_1.0_By_Lego_Stoppro.zip
http://www.mediafire.com/file/wo15pswxydfkaa5/Hover_Test.zip
http://www.mediafire.com/download/a9yyp9vnmlmhxal/Example_1.zip
. . . . .
http://www.mediafire.com/download/dpm0yti5f8q29fh/swap_Mouse_Buttons.zip
http://www.mediafire.com/download/d1vu3csnlh6i2yi/Rights_Modifier_by_Kvc.zip
http://www.mediafire.com/view/c0cge2ks8i676n2/Hiding_data.bat

Antonio

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#7 Post by PaperTronics » 03 May 2017 10:15

@Thor: Nice coding but it's kind of slow.

@Aacini: Your example isn't working. I've put in the same folder as Doc.txt. Am I doing something wrong?

Thor
Posts: 43
Joined: 31 Mar 2016 15:02

Re: How to extract data from website?

#8 Post by Thor » 03 May 2017 11:53

PaperTronics wrote:@Thor: Nice coding but it's kind of slow.

Try my code again, it should runs pretty decent now. :D

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: How to extract data from website?

#9 Post by Aacini » 03 May 2017 11:54

PaperTronics wrote:@Aacini: Your example isn't working. I've put in the same folder as Doc.txt. Am I doing something wrong?


Did you saved the code with .BAT extension? Did you reviewed that the output.txt file was not created? You may also test it removing the "> output.txt" part. If still don't works, please copy the output from the command-line window and paste it here...

Antonio

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#10 Post by PaperTronics » 04 May 2017 10:26

Aacini wrote:
Did you saved the code with .BAT extension? Did you reviewed that the output.txt file was not created? You may also test it removing the "> output.txt" part. If still don't works, please copy the output from the command-line window and paste it here...

Antonio


I wasn't able to read clearly since CMD was shutting down every time because of the error. I saved it with .BAT extension and output.txt was just a blank file. CMD says something like "Conditional Compiling is turned off"

PaperTronics

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#11 Post by PaperTronics » 04 May 2017 10:31

Try my code again, it should runs pretty decent now. :D


It did get a slight bit faster :)

Aacini
Expert
Posts: 1885
Joined: 06 Dec 2011 22:15
Location: México City, México
Contact:

Re: How to extract data from website?

#12 Post by Aacini » 04 May 2017 11:37

PaperTronics wrote:I wasn't able to read clearly since CMD was shutting down every time because of the error. I saved it with .BAT extension and output.txt was just a blank file. CMD says something like "Conditional Compiling is turned off"

PaperTronics


A couple points here:

In the very first place, you should run any problematic Batch file opening a cmd.exe window (the way to do that vary by Windows versions), then execute a CD command to the directory where the Batch file is, and finally run it entering its name. In this way any message remains in the screen, so you may paste it (via a right button click -> Mark), select the desired text pressing Shift key or left button, and press Enter key to end. After that, you may copy such a text. Do NOT run the Batch file from the explorer via a double-click on it.


Accordingly to the documentation, this error should not occur:
the documentation wrote:Conditional compilation is activated by using the @cc_on statement, or using an @if or @set statement.


Please, try this version of the code:

Code: Select all

@if (@CodeSection == @Batch) @then

@echo off
cscript //nologo //E:JScript "%~F0" < Doc.txt > output.txt
goto :EOF

@end

var search = /http:\/\/www\.mediafire\.com[^"]*/g, file = WScript.StdIn.ReadAll(), match;
while ( match = search.exec(file) ) WScript.Stdout.WriteLine(match[0]);

If still don't works, post the output from the command-line window...

Antonio

PaperTronics
Posts: 118
Joined: 02 Apr 2017 06:11

Re: How to extract data from website?

#13 Post by PaperTronics » 09 May 2017 07:08

The error states

Code: Select all

C:\Users\pratik\Desktop\BatchStore\DummyBase.bat(1, 6) Microsoft JScript compila
tion error: Conditional compilation is turned off

Hackoo
Posts: 103
Joined: 15 Apr 2014 17:59

Re: How to extract data from website?

#14 Post by Hackoo » 10 May 2017 12:35

Hi :D
Just give a try with this batch file :

Code: Select all

@echo off
Title Extract Mediafire href links by Hackoo 2017
mode con cols=70 lines=3 & color 9E
Set "vbsfile=%tmp%\%~n0.vbs"
Set "InputFile=Doc.txt"
Set "OutPutFile=All_Links.txt"
set "MediaFireLinks=MediaFireLinks.txt"
echo(
echo      Please wait a while ... Extracting is in progress ...
Call :ExtractLinks "%InputFile%" "%OutPutFile%"
Type "%OutPutFile%" | find /i "mediafire" > "%MediaFireLinks%"
start "" "%MediaFireLinks%"
exit
::****************************************************
:ExtractLinks <InputData> <OutPutData>
(
   echo InputFile = wscript.Arguments(0^)
   echo OutPutFile = wscript.Arguments(1^)
   echo Call ExtractLinks(InputFile,OutPutFile^)
   echo Function ExtractLinks(inputfile,outfile^)
   echo      Set fso = CreateObject("Scripting.FileSystemObject"^)
   echo      Set Link = fso.OpenTextFile(OutPutFile,2,True,-1^)
   echo      Set f = Fso.OpenTextFile(InputFile,1^)
   echo      Data = f.ReadAll
   echo      Set reLink = New RegExp
   echo      reLink.Global = True
   echo      reLink.IgnoreCase = True 
   echo      reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"                     
   echo      Set reText = New RegExp
   echo      reText.GLobal = True
   echo      reText.Pattern = "<[^>]*>"     
   echo      For Each Match in reLink.Execute(Data^)
   echo          HREF = Match.SubMatches(1^) ^& Match.SubMatches(2^)
   echo          'InnerText = reText.Replace(Match.SubMatches(3^), ""^)
   echo          Link.WriteLine HREF
   echo      Next 
   echo End Function
)>%vbsfile%
cscript /nologo "%vbsfile%" "%~1" "%~2"
exit /b
::**********************************************************************************

Hackoo
Posts: 103
Joined: 15 Apr 2014 17:59

Re: How to extract data from website?

#15 Post by Hackoo » 11 May 2017 06:38

Hi :)
This another tweaked version in order to extract all links from source code of a website, and also, can be filtered by string to be searched like ("Mediafire" "Aacini" "Thebateam") and If you want to extract the InnerText, just uncomment this line after HREF (get rid from quote)
Link.WriteLine HREF '^& " ========> " ^& InnerText
becomes

Code: Select all

Link.WriteLine HREF ^&  " ========> " ^& InnerText
or simply write like that :

Code: Select all

Link.WriteLine HREF
So the whole code of ExtractLinks.bat

Code: Select all

@echo off
Title Extracting HREF links from website source code by Hackoo 2017
REM Extract all links from source code of a website, and also, can be filtered by string to be searched
mode con cols=75 lines=3 & color 9E
set "vbsfile=%tmp%\%~n0.vbs"
set "InputFile=Doc.txt"
If Not exist "%InputFile%" (
   Color 0C
   echo(
   echo  The "%InputFile%" does not exist,please check it and re-run this batch again
   pause>nul
   exit
)
Set "OutPutFile=All_Links.txt"
set Filter_Strings="Mediafire" "Aacini" "Thebateam"
echo(
echo       Please Wait a While ... Extrating Links is in Progress ....
Call :ExtractLinks "%InputFile%" "%OutPutFile%"
For %%a in (%Filter_Strings%) Do (
   Type "%OutPutFile%" | find /I %%a > %~dp0%%a_Links.txt
   If exist "%~dp0%%a_Links.txt" Start "" "%~dp0%%a_Links.txt"
)
start "" "%OutPutFile%" & Exit
::*************************************************************************************************
:ExtractLinks <InputFile> <OutPutFile>
(
   echo InputFile = Wscript.Arguments(0^)
   echo OutPutFile = Wscript.Arguments(1^)
   echo Call ExtractLinks(InputFile,OutPutFile^)
   echo '-------------------------------------------------------------------------------------------
   echo Function ExtractLinks(InputFile,OutPutFile^)
   echo      Set fso = CreateObject("Scripting.FileSystemObject"^) 
   echo      Set f = Fso.OpenTextFile(InputFile,1^)
   echo      Set Link = fso.OpenTextFile(OutPutfile,2,True,-1^)
   echo      Data = f.ReadAll
   echo      Set reLink = New RegExp
   echo      reLink.Global = True
   echo      reLink.IgnoreCase = True 
   echo      reLink.Pattern = "<a\b[^>]*\bhref=(?:([""'])([\s\S]+?)\1|([^\s>]*))[^>]*>([\s\S]+?)</a>"
   echo      Set reText = New RegExp
   echo      reText.GLobal = True
   echo      reText.Pattern = "<[^>]*>"     
   echo      For Each Match in reLink.Execute(Data^)
   echo          HREF = Match.SubMatches(1^) ^& Match.SubMatches(2^)
   echo          InnerText = reText.Replace(Match.SubMatches(3^), ""^)
   echo 'If you want to extract the InnerText just uncomment this line after HREF (get rid from quote^)
   echo          Link.WriteLine HREF '^&  " ========> " ^& InnerText
   echo      Next 
   echo End Function
   echo '-------------------------------------------------------------------------------------------
)>"%vbsfile%"
Cscript /nologo "%vbsfile%" "%~1" "%~2"
exit /b
::*************************************************************************************************
Attachments
ExtractLinks_by_Hackoo.rar
Extract all links from source code of a website and can be filtered by string to be searched
(1.05 KiB) Downloaded 438 times

Post Reply