Command line script to mirror website

Message

#1 Post by **Squashman** » 29 Jul 2014 08:59

We have an internal website here at work that Corporate I.T. is going to shutdown. I had asked them if they planned on putting all the documents on the website to a shared folder on the network and their answer was NO. Of course they don't realize that a lot of people still rely on a lot of the documents up on that website.

So I am trying to use HTTRACK to just pull down all the documents from the website. Specifically the .BAT files, Text and Word documents that are on the website.

I started using the HTTRACK GUI to mirror the site but it doesn't seem to be pulling down any of the documents and of course it is pulling down way more than I need. Now I see there is a command line version of HTTRACK so I am trying to play with that to just pull down the documents I need. Not sure if this is possible because I think if I put in a filter for it too only look at *.bat it will not parse the HTML files to find the links to the .BAT files.

Anyone ever use HTTRACK or something similar to parse a website for specific document types? Does WGET or CURL have this functionality?

#2 Post by **Squashman** » 29 Jul 2014 09:44

OK for some reason HTTrack is choking on the 3rd level and it is giving me an error message saying I am not authorized to get to this link. Which makes no sense because it can't get to the first or 2nd levels without my username and password and I put my username and password into HTTrack.

Basically this internal site is a listing of all our clients and it has documents in each clients site.
The initial top level link I gave HTTrack to parse is:

Code: Select all

https://www.domain.com/clients/clients.nsf?OpenDatabase

That link basically opens up to a webpage with the Letters A through Z across the screen. So if I want to get to ABC Company, I just click A.
The link for that is just:

Code: Select all

https://www.domain.com/clients/clients.nsf/View%20Form%20A?OpenForm

That opens up just fine and shows me all the client names that begin with the letter A.
Now when HTTRack tries to open any links from here it gives an error message.

Code: Select all

09:45:45   Error:    "Unauthorized" (401) at link https://www.domain.com/Clients/ABCcompany/ABCcompany.nsf?OpenDatabase (from https://www.domain.com/clients/clients.nsf/View Form A?OpenForm)

Which makes no sense because it could not have gotten to the first two levels if I wasn't authorized.

The links to all the documents are mostly on this 3rd level.

#3 Post by **Squashman** » 29 Jul 2014 10:23

OK, I figured out why HTTrack was choking on the 3rd level.\
I originally had the main link to parse set as HTTP and not HTTPS. The redirection from HTTP to HTTPS was working for the first two levels but it would not work for the third level. But I am still pulling down more than I want. I really only want to pull down a few document types and not the whole damn website.

#4 Post by **foxidrive** » 29 Jul 2014 18:02

Wget has the ability to mirror and restrict to certain filetypes, and you can stop it from parsing links to offsite servers.

It's been many years since I used it and I can't readily test this

Code: Select all

WGET.EXE --convert-links --page-requisites --html-extension --recursive --no-parent --tries=3 --force-directories --user-agent=FGET --accept .bat,.doc,.txt --secure-protocol=auto --user=USER --password=PASSWORD https://domain.com/toplevelfolder/

--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-html content, etc.

--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given html page. This includes such things as inlined images, sounds, and referenced stylesheets.

--html-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
Note that filenames changed in this way will be re-downloaded every time you re-mirror a site

--recursive
Turn on recursive retrieving. See {Recursive Download}, for more details.
--level=depth
Specify recursion maximum depth level depth (see {Recursive Download}). The default maximum depth is 5.

--no-parent
The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories.

--tries=number
Set number of retries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like connection refused or not found (404), which are not retried.

--user-agent=agent-string
Identify as agent-string to the http server.

--accept list
2.11 Recursive Accept/Reject Options
-A acclist --accept acclist
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to accept or reject (see {Types of Files}). Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

--secure-protocol=protocol
Choose the secure protocol to be used. Legal values are auto, SSLv2, SSLv3, and TLSv1. If auto is used, the SSL library is given the liberty of choosing the appropriate protocol automatically, which is achieved by sending an SSLv2 greeting and announcing support for SSLv3 and TLSv1. This is the default.
Specifying SSLv2, SSLv3, or TLSv1 forces the use of the corresponding protocol. This is useful when talking to old and buggy SSL server implementations that make it hard for OpenSSL to choose the correct protocol version. Fortunately, such servers are quite rare.

INFO

--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including -nc. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.

When running Wget without -N, -nc, -r, or p, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named file.1. If that file is downloaded yet again, the third copy will be named file.2, and so on. When -nc is specified, this behavior is suppressed, and Wget will refuse to download newer copies of file. Therefore, “no-clobber” is actually a misnomer in this mode-it's not clobbering that's prevented (as the numeric suffixes were already preventing clobbering), but rather the multiple version saving that's prevented.

DosTips.com

Command line script to mirror website

Command line script to mirror website

Re: Command line script to mirror website

Re: Command line script to mirror website

Re: Command line script to mirror website