Wget has the ability to mirror and restrict to certain filetypes, and you can stop it from parsing links to offsite servers.
It's been many years since I used it and I can't readily test this
Code: Select all
WGET.EXE --convert-links --page-requisites --html-extension --recursive --no-parent --tries=3 --force-directories --user-agent=FGET --accept .bat,.doc,.txt --secure-protocol=auto --user=USER --password=PASSWORD https://domain.com/toplevelfolder/
--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-html content, etc.
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given html page. This includes such things as inlined images, sounds, and referenced stylesheets.
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when you're downloading CGI-generated materials. A URL like
http://site.com/article.cgi?25 will be saved as article.cgi?25.html.
Note that filenames changed in this way will be re-downloaded every time you re-mirror a site
--recursive
Turn on recursive retrieving. See {Recursive Download}, for more details.
--level=depth
Specify recursion maximum depth level depth (see {Recursive Download}). The default maximum depth is 5.
--no-parent
The simplest, and often very useful way of limiting directories is disallowing retrieval of the links that refer to the hierarchy above than the beginning directory, i.e. disallowing ascent to the parent directory/directories.
--tries=number
Set number of retries to number. Specify 0 or inf for infinite retrying. The default is to retry 20 times, with the exception of fatal errors like connection refused or not found (404), which are not retried.
--user-agent=agent-string
Identify as agent-string to the http server.
--accept list
2.11 Recursive Accept/Reject Options
-A acclist --accept acclist
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to accept or reject (see {Types of Files}). Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.
--secure-protocol=protocol
Choose the secure protocol to be used. Legal values are auto, SSLv2, SSLv3, and TLSv1. If auto is used, the SSL library is given the liberty of choosing the appropriate protocol automatically, which is achieved by sending an SSLv2 greeting and announcing support for SSLv3 and TLSv1. This is the default.
Specifying SSLv2, SSLv3, or TLSv1 forces the use of the corresponding protocol. This is useful when talking to old and buggy SSL server implementations that make it hard for OpenSSL to choose the correct protocol version. Fortunately, such servers are quite rare.
INFO--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including -nc. In certain cases, the local file will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.
When running Wget without -N, -nc, -r, or p, downloading the same file in the same directory will result in the original copy of file being preserved and the second copy being named file.1. If that file is downloaded yet again, the third copy will be named file.2, and so on. When -nc is specified, this behavior is suppressed, and Wget will refuse to download newer copies of file. Therefore, “no-clobber” is actually a misnomer in this mode-it's not clobbering that's prevented (as the numeric suffixes were already preventing clobbering), but rather the multiple version saving that's prevented.