Colly goes 1.0

We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.

Facts about the development

It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.

Milestones

// Create a DataSet (allows two way data-binding) var items = new vis.DataSet([ {id: 1, content: ‘First commit’, start: ‘2017-09-29’}, {id: 2, content: ‘First external contribution’, start: ‘2017-10-05’}, {id: 3, content: ‘Caching implemented’, start: ‘2017-10-08’}, {id: 4, content: ‘Declarative HTML parsing’, start: ‘2017-11-10’}, {id: 5, content: ‘Repository moved to gocolly organization’, start: ‘2017-11-11’}, {id: 6, content: ‘Debugger interface added’, start: ‘2017-11-14’}, {id: 7, content: ‘XPath and XML support’, start: ‘2018-01-13’}, {id: 8, content: ‘Async crawling’, start: ‘2018-01-20’}, {id: 9, content: ‘Storage interface implemented’, start: ‘2018-02-09’}, {id: 10, content: ‘Performance optimizations (+25% speed)’, start: ‘2018-03-03’}, {id: 11, content: ‘Extensions added’, start: ‘2018-03-12’}, {id: 12, content: ‘Twitter account created’, start: ‘2018-03-13’}, {id: 13, content: ‘Requests queue interface’, start: ‘2018-04-13’}, {id: 14, content: ‘Version 1.0.0’, start: ‘2018-05-13’}, ]);

// Configuration for the Timeline var options = { stack: true, tooltip: { // followMouse: true, // overflowMethod: ‘cap’ } };

// Create a Timeline var timeline = new vis.Timeline(container, items, options);

Release v1.0.0

You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?

  • Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.

  • Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.

  • Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.

  • Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.

  • Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.

  • Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.

  • Goodies The package named extensions provides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension is RandomUserAgent which generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions

  • Debuggers Debugging can be painful. Colly tries to ease the pain by providing Debuggers to inspect your scraper. You can simply write debug messages to the console by using LogDebugger. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing a WebDebugger. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug

We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.

2018.05.13
@kvch