(Note: I’m still in the process of writing the hints and requirements to this assignment, though you’re welcome to try drafting out ideas) Link-rot, or URLs that have been mistyped, is a common problem in web publishing. In this assignment, you’ll write a practical tool that uses curl to quickly check which hyperlinks on a given web page are broken.
url-checker.sh script takes in one argument: a URL to gather up hyperlinks from. Then the script visits each URL and retrieves the HTTP status code, i.e.
bash url-checker.sh http://www.example.com
The output of
url-checker.sh is a comma-delimited list containing two columns: the URL, and its status code, sorted by the URLs in alphabetical order:
http://en.wikipedia.org/,200 http://www.example.com/broken,404 http://www.example.com/hello,200 https://www.facebook.com,200
url-checker.sh script should only check each URL on a given page exactly once. And if relative URLs are found on a page,
url-checker.sh will have to resolve them to absolute URLs before visiting them.
You should partition your
url-checker.sh script into a few phases:
Use the pup tool to extract all of the
href's from the given page.
Some of the URLs may be relative, i.e. if you visit "http://www.whitehouse.gov", you might find a
href that points to the relative URL,
"/pictures/index.html". Your script will need to translate this URL into its absolute form, i.e. "http://www.whitehouse.gov/pictures/index.html"
Some URLs may be repeated on the page, so use
uniq to create a list of unique URLs.
Use curl and its various options for finding the HTTP response code for a given URL. Do not output the content of each visited link. The purpose of
url-checker.sh is to find the HTTP status of each URL, not save their content.