25 August 2016

My first encounter with web scrapping/crawling was with R, I read ‘Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining’ by Simon Munzert et al. It is a really well-written, thorough book, I would definitely recommend having a read.

However, I have to admit the potential for scrapping the web using python is far greater than that with R. To do anything sophisticated, one shouldn’t burden oneself with R.

I had this idea of extracting the publication records from staffs in the department and then performing some data analysis on it about a month or two ago. Recently I started slowly writing the code from time to time.

The crawler technology required to achieve this is rather simple, so it is a good way to familiarise myself with the pythonic way of crawling. More importantly, I would like to take this course on EdX about node.js that uses MongoDB as its back end. The data I will extract from this experiment is a good way for me to utilise such No-SQL database.

The procedure is quite straight forward:

  1. I found all the staffs from IC chemistry is registered on this page. And researchers who have publications have an url connecting to their homepage. So I need to extract the relevant html element to connect me to each researchers’ homepage.

  2. The relevant researchers will have a publications tab on their homepage. Once navigated to publications, I can just iterated through the pages, and extract the required information for each publication record, and store it.

The libraries required for this are as below. Obviously MongoDB needs to be installed separately as well.

import requests
from bs4 import BeautifulSoup
import codecs
import csv
import re
from pymongo import MongoClient

The following corresponds to step 1.

def getHTML(url):
    r = requests.get(url)
    if r.ok:
        return r.content
    else: return None

def getStaffs(page):
    div_people_list = page.find('ul', attrs={'class' : 'people list'})
    a_s = div_people_list.find_all('a', attrs={'class' : "name-link"})

    # Title bar for the csv files
    staffList = []
    for a in a_s:
        name = a.find('span', attrs={'class' : 'person-name'}).get_text()
        title = a.find('span', attrs={'class' : 'job-title'}).get_text()
        link = a['href']
        staffList.append({
            'name' : name,
            'title' : title,
            'link' : link,
        })
    return staffList

Using the code below, I stored all the members that have a homepage. I kept the name of the staffs, their titles and the (relative) links to the homepage in a csv flat file. Not all members on the list will have a publication tab.

#print outcome
outcome = [["name", "title", "link"]]
for a in a_s:
    name = a.find('span', attrs={'class' : 'person-name'}).get_text()
    title = a.find('span', attrs={'class' : 'job-title'}).get_text()
    link = a['href']
    outcome.append([ name, title, link ])
with codecs.open("./staffList.csv", "wb") as f:
    writer = csv.writer(f)
    writer.writerows(outcome)

Step 2 is a bit more involved. I need functions to:

  • turn page -> nextPage()
  • get all the publication records on that page -> getPapers()
  • extract the relevant information in the correct fields for a given publication html div element, then store it in a MongoDB collection -> registerPaper()

The last function is tricky, given these publication records are not adhering to a strict schema. The way I am doing it here is not perfect.

def nextPage(page):
    tmp = page.find('a', attrs = {'class' : 'icon next'})
    if tmp:
        return tmp['href']
    else:
        return False

# get the list of paper on a page
def getPapers(page):
    papers = page.find_all('div', attrs = {'class' : 'publication'})
    return papers

# map key value pair of a paper given the div object
# fromWho shows which author's page the paper is from
def registerPaper(paper, fromWho):
    string = paper.p.get_text()


    # remove spaces and split the long stirng into different fields (e.g. author, year, title etc)
    strings = string.strip().split(',')
    fields = [s.strip() for s in strings]

    try:
        year_index = [i for i, item in enumerate(fields) if re.search('^[1|2][0-9]{3}$', item)][0]
    except IndexError:
        paperObj = {
            'from' : fromWho,
            'author(s)' : [author.upper() for author in fields[ : -1]],
            'title' : fields[-1],
            'publication-type' : paper.find('span', attrs = {'class' : 'publication-type'}).get_text(),
        }
        return paperObj


    # remove potential et al. suffix after the last author
    fields[year_index - 1] = fields[year_index - 1].replace('et al.', '' )

    num_citations = paper.find('span', attrs = {'class' : 'publication-citations'})
    if num_citations:
        num_citations = int(num_citations.get_text().split(':')[1])
    else:
        num_citations = 0

    paperObj = {
        'from' : fromWho,
        'author(s)' : [author.upper() for author in fields[ : year_index]],
        'year' : fields[year_index], # using it as category so doesn't need to be type integer

        'publication-type' : paper.find('span', attrs = {'class' : 'publication-type'}).get_text(),
        'num_citations' : num_citations,
    }
    try:
        journal = paper.em.get_text().split(',')[0]
        journal_index = fields.index(journal)
        #paperObj['journal'] = fields[year_index + 2].upper()
        paperObj['title'] = ','.join(fields[year_index + 1 : journal_index])
        paperObj['journal'] = journal
        paperObj['others'] = fields[year_index + 3 : ]
    except (IndexError, AttributeError) as err:
        #paperObj['title'] = ','.join(fields[year_index + 1 : ])
        paperObj['title'] = fields[year_index + 1  ]

    return paperObj

To thread everything together:

def main():
    # Database prep
    client = MongoClient()
    db = client.ICchem
    db.literatures.drop()
    lit = db.literatures

    # Get all staffs that has their own page
    HOST = "https://www.imperial.ac.uk"
    staffPage = HOST + "/chemistry/about/contacts/all-staff/"
    html = getHTML(staffPage)
    soup = BeautifulSoup(html,'html.parser')
    staffList = getStaffs(soup)

    # Obtain papers for each staff
    for staff in staffList:
        print staff['name']
        start = HOST + staff['link'] + '/publications.html'
        url = start
        while True:
            try:
                html = getHTML(url)
            except requests.ConnectionError:
                break
            soup = BeautifulSoup(html, 'html.parser')
            for i in getPapers(soup):
                #print registerPaper(i)['author(s)']
                #TODO Storage
                lit.insert_one(registerPaper(i, staff['name']))

            url = nextPage(soup)
            if not url : break
            url = start + url

The complete repo for this project is here.

In the next post I choose to switch to R for data cleaning, and looking at some interesting statistics. I chose to switch to R for a few reasons:

  • I like Rstudio, it is very interactive
  • I like R for exploratory data analysis purposes
  • I like ggplot2, way more beautiful than matplotlib
  • Good practice with Rmarkdown


blog comments powered by Disqus