Scrap the Web with Go

As a part of a project that I’ve been involved, I released opensource a Scraper in Go that I called gopherscraper is a scraper to extract information of ecommerce sites, but you can extrapolate and extract information of any website with ‘items’, like news, videos, and so on.

The project comes with a Rest API that you can check on Github for more detail, and is based in CSS Selectors. To store the items, I use Redis and ElasticSearch.

Let me share a few insights in the development of the library:

I’m using Goquery, to extract information based on CSS Selectors.
The interface for scrapping is really simple just one function.

type ScrapperItems interface {
    Scrap(selector ScrapSelector) (string, chan ItemResult, error)
}

The interface to store the result items is also very simple. You can store the items in Redis, ElasticSearch or as a file.

type StorageItems interface {
    StoreItem(it ItemResult)
}

// store items as a File
func (sto FileStorage) StoreItem(it ItemResult) {
    if it.Err != nil {
        return
    }
    WriteJsonToDisk(sto.baseDir, it.Item)
}

Was easy to create a recursive Scraper, using composition, based in a normal scraper.
I started with not concurrency code at all, and after make it work, I put the go routines and waitgroups to syncronize the completion of the scraper.

func (d DefaultScrapper) Scrap(selector ScrapSelector) (string, chan ItemResult, error) {
    wg := &sync.WaitGroup{}
    err := validateSelector(selector)
    if err != nil {
        return "", nil, err
    }

    items := make(chan ItemResult, bufferItemsSize)

    jobId := "D" + GenerateStringKey(selector)
    pages := paginatedUrlSelector(selector)

    wg.Add(len(pages))
    for i, _ := range pages {
        go doScrapFromUrl(jobId, pages[i], items, wg)
    }

    go closeItemsChannel(jobId, items, wg)

    return jobId, items, err
}

func doScrapFromUrl(jobId string, s ScrapSelector, items chan ItemResult, wg *sync.WaitGroup) {
    defer wg.Done()
    doc, err := fromUrl(s)
    if err != nil {
        log.Printf("ERROR [%s] Scrapping %v with message %v", jobId, s.Url, err.Error())
        return
    }
    DocumentScrap(jobId, s, doc, items)
}

func closeItemsChannel(jobId string, items chan ItemResult, wg *sync.WaitGroup) {
    wg.Wait()
    close(items)
}

I limited the number of concurrent connections with a buffered channel.

func fromUrl(selector ScrapSelector) (*goquery.Document, error) {
    lockLimitConnections()
    defer unlockLimitConnections()

    req, err := http.NewRequest("GET", selector.Url, nil)
    if err != nil {
        return nil, err
    }

    req.Header.Add("User-Agent", defaultUserAgent)

    res, err := httpClient().Do(req)
    if err != nil {
        return nil, err
    }
    return goquery.NewDocumentFromResponse(res)
}

func UseMaxConnections(max int) {
    semaphoreMaxConnections = make(chan struct{}, max)
}

func lockLimitConnections() {
    semaphoreMaxConnections <- struct{}{}
}
func unlockLimitConnections() {
    <-semaphoreMaxConnections
}

At the end It was a really fun, doing what it looks like a tedious job. And I get a clean JSON document when there is not any API available to use.

comments powered by Disqus

December 3, 2014
435 words

Categories

Tags
golang

Scrap the Web with Go

December 3, 2014 435 words

@dahernan

December 3, 2014
435 words