First things first

The idea behind this article comes from my experience building Kelsier, which idea itself comes from this article.

The code that I’m going to show below is very likely to have and will have aspects that can be improved. So please, if you see something you don’t think is entirely ideal or there’s something you know can be done in a better way, let me know by opening an issue in the repository of this site.

What is a Web scraper

All starts with Data scraping. Which is basically an operation in which a computer program extracts information from the output of another program which output itself was intended to be displayed to an end-user, not passed to another program.

Data scraping applied to Web is called Web scraping. And I think you can pretty much see where things are going. Web scraping could be define as the act of extracting and processing information from a website that, for example, does not provide other more easy ways of extracting data such as an API. In our case, what we want is extract all the links to check if they are dead and therefore should be fixed or not.

Web scraping is actually a much deeper topic than simply extracting links from a web page, so I’ll try to leave you some links at the end of the article if you want to explore more about the subject.

A dead link is basically a link that doesn’t work anymore. This usually happens when the page or resource the links was pointing to no longer exists.

One of the main reasons why we would want to find and fix all the dead links on a website is due to its negative effects on the SEO.

Building our Web scraper

In our case, the web scraper that we are going to build is going to be very simple. First, it will receive an URL that after checking that it’s valid will be used to make an HTTP GET request. Next, we’ll use the body of the response to try to extract all the links we can find. Finally, once we have finished extracting links, we’ll parse each one one by one, check its status with another HTTP GET request and report it in the terminal. Easy, right?

Getting our main URL and parsing it

To get our main URL we have two options:

  • Use the flag package.
  • Use the os package.

Since at the moment the only thing we want to receive is a simple URL with no optional parameters we are going to use the os package.

So let’s start working in our main.go file:

package main

import (
    "fmt"
    "os"
)

func main() {
    fmt.Println(os.Args)
}

If we now save our main.go file and execute it:

go run main.go google.com

We should see an output like this:

[/tmp/go-build412963032/b001/exe/web-scraper google.com]

We are basically printing two things here:

  1. The name of our program.
  2. The first an unique argument of our program.

As you can read in the os package documentation: “Args hold the command-line arguments, starting with the program name.”

Since the only thing we need is the first argument, we can change our code to look like this:

package main

import (
	"fmt"
	"os"
)

func main() {
	url := os.Args[1] // It will panic if there aren't enough arguments.
	fmt.Println(url)
}

And after saving and executing it again with the same parameter we should get:

google.com

Perfect! Isn’t it? Now let’s make an HTTP request:

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"os"
)

func main() {
	// First, we check that we have enough arguments.
	if len(os.Args) < 1 {
		log.Printf("not enough arguments provided")
		os.Exit(1)
	}

	url := os.Args[1]
	url = "https://" + url // We "parse" our URL.

	// Then, we create a new HTTP GET request.
	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		log.Printf("while creating a new request for %q: %v", url, err)
		os.Exit(1)
	}

	// We also create an HTTP client to make the request.
	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		log.Printf("while making request to %q: %v", url, err)
		os.Exit(1)
	}

	// We try to read the response's body.
	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Printf("while reading the response body from %q: %v", url, err)
		os.Exit(1)
	}

	// And finally, if everything went okay, we just convert body from a []byte
	// to an string and print it.
	fmt.Println(string(body))
}

If now we save and execute again we should get a very large blob of HTML code, which is a huge advance but not what we really want.

Extracting the URLs

What we really want is to extract all the links in the HTML response. But before that it would be convenient to have clear the structure of a link in HTML:

<a href="/" alt="home">Home</a>

As you can see, in HTML the links are defined with the a tag and normally have the href attribute.

So now, we are gonna to improve our code and instead of convert the response’s body into a string to print it, we are going to parse it and extract from it all the links found:

package main

import (
	"fmt"
	"log"
	"net/http"
	"os"

	"golang.org/x/net/html"
)

func main() {
	if len(os.Args) <= 1 {
		log.Printf("not enough arguments provided")
		os.Exit(1)
	}

	url := os.Args[1]
	url = "https://" + url

	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		log.Printf("while creating a new request for %q: %v", url, err)
		os.Exit(1)
	}

	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		log.Printf("while making request to %q: %v", url, err)
		os.Exit(1)
	}
	// We defer the closing of our response's body.
	defer resp.Body.Close()

	// Then, we check that the status code of the response
	// is 200 OK. If it's not we exit.
	if resp.StatusCode != http.StatusOK {
		log.Printf("response status is not %v", http.StatusOK)
		os.Exit(1)
	}

	// Now we parse our response's body.
	doc, err := html.Parse(resp.Body)
	if err != nil {
		log.Printf("while parsing the response body from %q: %v", url, err)
		os.Exit(1)
	}

	// And finally we extract all the links found and print them one
	// by one.
	links := findLinks(nil, doc)
	for i, l := range links {
		fmt.Println(i, l)
	}
}

// findLinks is a recursive function, that means that it calls itself.
// If you look at the code carefully you'll see that it's not really that complicated.
func findLinks(links []string, n *html.Node) []string {
	// We check that our HTML node is of type html.ElementNode and that
	// our HTML node's data (tag) is a link ("a").
	if n.Type == html.ElementNode && n.Data == "a" {
		// Then, we range over the attributes of our node and when we find
		// and "href" attribute we append its value to our links slice which
		// later will be returned.
		for _, a := range n.Attr {
			if a.Key == "href" {
				links = append(links, a.Val)
			}
		}
	}

	// Here we are basically ranging over our nodes and calling
	// our function recursively.
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		links = findLinks(links, c)
	}

	// When we are over looping, we simply return all the links found.
	return links
}

When we are parsing our response’s body we are getting a tree. There are ancestors, descendants, parents and childrens. But trees have a complication, it is not as easy to iterate over all its elements as it would be for example in a slice or in a linked list. That’s why we are using recursions instead of a for loop.

If you want learn more about the golang.org/x/net/html or about the html.Parse function you can check the package documentation here. Also, the ‘findLinks’ function is mainly extracted from the code examples of the gopl book. Book that by the way, I recommend.

Formatting our URLs

Now that we have our links, it’s time to make sure they’re well formatted so we don’t get false negatives. To format them, we are going to follow the following rules:

  • If the link is /, we’re going to change it to our base URL.
  • If the link is something like /about, we’re going to prepend our base URL.
  • If the links is an anchor like /#social we’re going to prepend our base URL.

To do this, we’re going to create a function called formatURL:

// formatURL receive our base URL and the link that
// we want to format.
func formatURL(base, url string) string {
	base = strings.TrimSuffix(base, "/")

  // As you can see, we are applying the rules
  // we have defined above to parse our links.

	switch {
	case strings.HasPrefix(url, "/"):
		return base + url
	case strings.HasPrefix(url, "#"):
		return base + "/" + url
	case !strings.HasPrefix(url, "http"):
		return "https://" + url
	default:
		return url
	}
}

Now it’s time to check that each link works. For this we’re going to make an HTTP GET request and we’re going to print the received status code next to the link in question:

package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	if len(os.Args) <= 1 {
		log.Printf("not enough arguments provided")
		os.Exit(1)
	}

	url := os.Args[1]
	url = "https://" + url // We "parse" our URL.

	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		log.Printf("while creating a new request for %q: %v", url, err)
		os.Exit(1)
	}

	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		log.Printf("while making request to %q: %v", url, err)
		os.Exit(1)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		log.Printf("response status is not %v", http.StatusOK)
		os.Exit(1)
	}

	doc, err := html.Parse(resp.Body)
	if err != nil {
		log.Printf("while parsing the response body from %q: %v", url, err)
		os.Exit(1)
	}

	links := findLinks(nil, doc)

	for _, l := range links {
		fl := formatURL(url, l) // Our formatted link
		req, err := http.NewRequest(http.MethodGet, fl, nil)
		if err != nil {
			log.Printf("while creating a new request for %q: %v", fl, err)
			continue
		}

		resp, err := client.Do(req)
		if err != nil {
			log.Printf("while making request to %q: %v", fl, err)
			continue
		}
		resp.Body.Close()

		fmt.Println(resp.StatusCode, fl)
	}
}

You can grab and refactor all the code inside the for loop in a function called check, for example. In this case I won’t do it for the sake of brevity.

Our final code

Our final code should look something like:

package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	if len(os.Args) <= 1 {
		log.Printf("not enough arguments provided")
		os.Exit(1)
	}

	url := os.Args[1]
	url = "https://" + url // We "parse" our URL.

	req, err := http.NewRequest(http.MethodGet, url, nil)
	if err != nil {
		log.Printf("while creating a new request for %q: %v", url, err)
		os.Exit(1)
	}

	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		log.Printf("while making request to %q: %v", url, err)
		os.Exit(1)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		log.Printf("response status is not %v", http.StatusOK)
		os.Exit(1)
	}

	doc, err := html.Parse(resp.Body)
	if err != nil {
		log.Printf("while parsing the response body from %q: %v", url, err)
		os.Exit(1)
	}

	links := findLinks(nil, doc)

	for _, l := range links {
		fl := formatURL(url, l) // Our formatted link
		req, err := http.NewRequest(http.MethodGet, fl, nil)
		if err != nil {
			log.Printf("while creating a new request for %q: %v", fl, err)
			continue
		}

		resp, err := client.Do(req)
		if err != nil {
			log.Printf("while making request to %q: %v", fl, err)
			continue
		}
		defer resp.Body.Close()

		fmt.Println(resp.StatusCode, fl)
	}
}

func findLinks(links []string, n *html.Node) []string {
	if n.Type == html.ElementNode && n.Data == "a" {
		for _, a := range n.Attr {
			if a.Key == "href" {
				links = append(links, a.Val)
			}
		}
	}

	for c := n.FirstChild; c != nil; c = c.NextSibling {
		links = findLinks(links, c)
	}

	return links
}

func formatURL(base, url string) string {
	base = strings.TrimSuffix(base, "/")

	switch {
	case strings.HasPrefix(url, "/"):
		return base + url
	case strings.HasPrefix(url, "#"):
		return base + "/" + url
	case !strings.HasPrefix(url, "http"):
		return "https://" + url
	default:
		return url
	}
}

Optional steps

The code in this post has many aspects that you can improve, for example:

  • Concurrency.
  • Tests.
  • Better error handling.

In future posts I’ll talk about the three previous points but if you want to improve the code already, you can just get on with it.