Scrap the Web with Go


As a part of a project that I’ve been involved, I released opensource a Scraper in Go that I called gopherscraper is a scraper to extract information of ecommerce sites, but you can extrapolate and extract information of any website with ‘items’, like news, videos, and so on.

The project comes with a Rest API that you can check on Github for more detail, and is based in CSS Selectors. To store the items, I use Redis and ElasticSearch.

Let me share a few insights in the development of the library:

  • I’m using Goquery, to extract information based on CSS Selectors.

  • The interface for scrapping is really simple just one function.

type ScrapperItems interface {
    Scrap(selector ScrapSelector) (string, chan ItemResult, error)
}
  • The interface to store the result items is also very simple. You can store the items in Redis, ElasticSearch or as a file.
type StorageItems interface {
    StoreItem(it ItemResult)
}

// store items as a File
func (sto FileStorage) StoreItem(it ItemResult) {
    if it.Err != nil {
        return
    }
    WriteJsonToDisk(sto.baseDir, it.Item)
}
  • Was easy to create a recursive Scraper, using composition, based in a normal scraper.

  • I started with not concurrency code at all, and after make it work, I put the go routines and waitgroups to syncronize the completion of the scraper.

func (d DefaultScrapper) Scrap(selector ScrapSelector) (string, chan ItemResult, error) {
    wg := &sync.WaitGroup{}
    err := validateSelector(selector)
    if err != nil {
        return "", nil, err
    }

    items := make(chan ItemResult, bufferItemsSize)

    jobId := "D" + GenerateStringKey(selector)
    pages := paginatedUrlSelector(selector)

    wg.Add(len(pages))
    for i, _ := range pages {
        go doScrapFromUrl(jobId, pages[i], items, wg)
    }

    go closeItemsChannel(jobId, items, wg)

    return jobId, items, err
}

func doScrapFromUrl(jobId string, s ScrapSelector, items chan ItemResult, wg *sync.WaitGroup) {
    defer wg.Done()
    doc, err := fromUrl(s)
    if err != nil {
        log.Printf("ERROR [%s] Scrapping %v with message %v", jobId, s.Url, err.Error())
        return
    }
    DocumentScrap(jobId, s, doc, items)
}

func closeItemsChannel(jobId string, items chan ItemResult, wg *sync.WaitGroup) {
    wg.Wait()
    close(items)
}
  • I limited the number of concurrent connections with a buffered channel.
func fromUrl(selector ScrapSelector) (*goquery.Document, error) {
    lockLimitConnections()
    defer unlockLimitConnections()

    req, err := http.NewRequest("GET", selector.Url, nil)
    if err != nil {
        return nil, err
    }

    req.Header.Add("User-Agent", defaultUserAgent)

    res, err := httpClient().Do(req)
    if err != nil {
        return nil, err
    }
    return goquery.NewDocumentFromResponse(res)
}

func UseMaxConnections(max int) {
    semaphoreMaxConnections = make(chan struct{}, max)
}

func lockLimitConnections() {
    semaphoreMaxConnections <- struct{}{}
}
func unlockLimitConnections() {
    <-semaphoreMaxConnections
} 

At the end It was a really fun, doing what it looks like a tedious job. And I get a clean JSON document when there is not any API available to use.

comments powered by Disqus

December 3, 2014
435 words


Categories

Tags
golang

@dahernan

My profile image