Distributed scraping

Distributed scraping can be implemented in different ways depending on what the requirements of the scraping task are. Most of the time it’s enough to scale the network communication layer which can be easily achieved using proxies and Colly’s proxy switchers.

Proxy switchers

Using proxy switchers scraping still remains centralized while the HTTP requests are distributed among multiple proxies. Colly supports proxy switching via its’ SetProxyFunc() member. Any custom function can be passed to SetProxyFunc() with the signature of func(*http.Request) (*url.URL, error).

SSH servers can be used as socks5 proxies with the -D flag.

Colly has a built-in proxy switcher which rotates a list of proxies on every request.

Usage

package main

import (
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/proxy"
)

func main() {
	c := colly.NewCollector()

	if p, err := proxy.RoundRobinProxySwitcher(
		"socks5://127.0.0.1:1337",
		"socks5://127.0.0.1:1338",
		"http://127.0.0.1:8080",
	); err == nil {
		c.SetProxyFunc(p)
	}
	// ...
}

Implementing custom proxy switcher:

var proxies []*url.URL = []*url.URL{
	&url.URL{Host: "127.0.0.1:8080"},
	&url.URL{Host: "127.0.0.1:8081"},
}

func randomProxySwitcher(_ *http.Request) (*url.URL, error) {
	return proxies[random.Intn(len(proxies))], nil
}

// ...
c.SetProxyFunc(randomProxySwitcher)

Distributed scrapers

To manage independent and distributed scrapers the best you can do is wrapping the scraper in a server. Server can be any kind of service like HTTP, TCP servers or Google App Engine. Use custom storage to achieve centralized and persistent cookie and visited url handling.

Colly has built-in Google App Engine support. Don't forget to call Collector.Appengine(*http.Request) if you use Colly from App Engine standard environment.

An example implementation can be found here.

Distributed storage

Visited URL and cookie data are stored in-memory by default. This is handy for short living scraper jobs, but it can be a serious limitation when dealing with large scale or long running crawling jobs.

Colly has the ability to replace the default in-memory storage with any storage backend which implements colly/storage.Storage interface. Check out existing storages.