Having crawled the data in the previous post, I can extract different portions of it from my database and cleanse it to generate some plots and summary statistics.

First, load packages and establish connection to MongoDB (here my db is located on the default port on my localhost).

library(rmongodb)
library(stringr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(reshape)
library(igraph)

mongo <- mongo.create()
if(mongo.is.connected(mongo) == TRUE) {
    db <- mongo.get.databases(mongo)
    coll <- mongo.get.database.collections(mongo, db)
}

Staffs

The records are obtained from 120 staffs. We can look at the relative proportion of Doctors and Professors in the department.

df <- data.frame(names = mongo.distinct(mongo, coll, 'from'))
df$isProf <- str_detect(toupper(df$names), "PROFESSOR")
ggplot(df, aes(x = factor(1), fill = isProf) ) + geom_bar()  + coord_polar(theta = "y")+ theme(axis.title.x=element_blank(),        
        axis.text.x=element_blank(),
        axis.title.y=element_blank(),
        axis.text.y=element_blank())

plot of chunk unnamed-chunk-2

Publication

There are in total 9417 entries. The entries are of the following types:

## [1] "JOURNAL ARTICLE"  "CONFERENCE PAPER" "BOOK CHAPTER"    
## [4] "PATENT"           "REPORT"           "SOFTWARE"        
## [7] "OTHER"            "BOOK"             "POSTER"

We can take a look at how many entries of each publication type are there, note the y-axis is on a log scale:

df <- mongo.find.all(mongo, coll, fields = list('publication-type' = 1, '_id' = 0), data.frame = T)
ggplot(df, aes(x = publication.type) ) + geom_bar() +  geom_text(aes(label = ..count..), stat = "count", vjust = -1)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_log10(breaks = c(1,10,100,1000,10000))

plot of chunk unnamed-chunk-4

It is most interesting to look at only journal & conference publications, which are the vast majority of entries anyways. There are 8953 such entries. Bear in mind there are duplicate entries, where IC chemistry staffs collaborated to produce a piece of work.

main <- mongo.find.all(mongo, coll, query=list("$or" = list(list("publication-type" = "JOURNAL ARTICLE"), list("publication-type" = "CONFERENCE PAPER"))), fields = list('publication-type' = 1, 'title' = 1, 'num_citations' = 1, 'journal' = 1, 'year' = 1, 'from' = 1, '_id' = 0))


#  Only keeps data entries that have all the header fields defined.
list2df <- function(list, headers){
    output = list()
    for(i in 1:length(list)){
        record = T
        for(j in headers){
            if(is.null(list[[i]][[j]])){
                record = F
                break
            }
        }
        if(record == T){
            for(j in headers)
                output[[j]] <-  c(output[[j]], list[[i]][[j]])
        }
    }
    return(as.data.frame(output))
}

df <- list2df(main, c('journal', 'title'))

The 8116 journal submissions are delivered to 1102 different journals. A coarse look at the number of submissions to each of the journals. Have a guess what is the most popular journal(the tallest bar). Bear in mind multiple academics could be have drafted the same paper, this needs to be accounted for. I have listed the top 10 most and (some of the) least popular journals.

df <- df[!duplicated(df$title), ]
ggplot(df, aes(x = journal) ) + geom_bar() + theme(axis.text.x=element_blank(),axis.ticks=element_blank())

plot of chunk unnamed-chunk-7

q <- aggregate(df$journal, by=list(df$journal), length)
colnames(q) <- c("Journal.Names", "Submission")
q <- q[order(q$Submission, decreasing = T),]
kable(tail(q, 10), row.names = F)

Journal.Names	Submission
VIBRATIONAL SPECTROSCOPY	1
WATER ENVIRONMENT RESEARCH	1
WATER SCIENCE AND TECHNOLOGY	1
WEAR	1
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE	1
WILEY INTERDISCIPLINARY REVIEWS-ENERGY AND ENVIRONMENT	1
ZEITSCHRIFT FUR ANORGANISCHE UND ALLGEMEINE CHEMIE	1
ZEITSCHRIFT FUR PHYSIK B-CONDENSED MATTER	1
Zeitschrift für Physikalische Chemie	1
ZHURNAL EKSPERIMENTALNOI I TEORETICHESKOI FIZIKI	1

kable(head(q, 10), row.names = F)

Journal.Names	Submission
CHEMICAL COMMUNICATIONS	249
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY	231
JOURNAL OF THE CHEMICAL SOCIETY-CHEMICAL COMMUNICATIONS	167
JOURNAL OF ORGANIC CHEMISTRY	165
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY	135
TETRAHEDRON LETTERS	135
JOURNAL OF CHEMICAL PHYSICS	123
PHYSICAL CHEMISTRY CHEMICAL PHYSICS	117
JOURNAL OF THE CHEMICAL SOCIETY-PERKIN TRANSACTIONS 1	112
JOURNAL OF THE CHEMICAL SOCIETY-DALTON TRANSACTIONS	106

Let’s look at the number of submissions to the top 20 most popular journals:

q$Journal.Names <- factor(q$Journal.Names, levels = q$Journal.Names)
ggplot(q[1:20, ], aes(x = Journal.Names, y = Submission)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

plot of chunk unnamed-chunk-8

There does not seem to be a significant bias towards any of the three areas of chemistry.

Citations

df <- list2df(main, c('title', 'num_citations'))
df$title <- str_trim(df$title)
df <- df[!duplicated(df$title), ]
df <- df[order(df$num_citations, decreasing = T), ]
ggplot(df, aes(x = num_citations)) + geom_bar() + xlim(0, 250) + ylim(0,300)

plot of chunk unnamed-chunk-9

There are some ‘citation beasts’ - outliers not included in the above bar chart:

title	num_citations
A 2ND GENERATION FORCE-FIELD FOR THE SIMULATION OF PROTEINS,NUCLEIC-ACIDS,AND ORGANIC-MOLECULES	7810
The path forward for biofuels and biomaterials	2373
A strong regioregularity effect in self-organizing conjugated polymer films and high-efficiency polythiophene: fullerene solar cells	1647
The search for new-generation olefin polymerization catalysts: Life beyond metallocenes	1608
N-Heterocyclic Carbenes in Late Transition Metal Catalysis	1438
Liquid-crystalline semiconducting polymers with high charge-carrier mobility	1271
TETRAPROPYLAMMONIUM PERRUTHENATE,PR4N+RUO4-,TPAP - A CATALYTIC OXIDANT FOR ORGANIC-SYNTHESIS	1243
FERROCENE-MEDIATED ENZYME ELECTRODE FOR AMPEROMETRIC DETERMINATION OF GLUCOSE	1224
ORGANOMETALLIC COMPOUNDS FOR NONLINEAR OPTICS - THE SEARCH FOR EN-LIGHT-ENMENT	1221
Synthesis and function of 3-phosphorylated inositol lipids	1101
N-heterocyclic carbenes as organocatalysts	966
Development of a dispersion process for carbon nanotubes in an epoxy matrix and the resulting electrical properties	944
Ultra-low electrical percolation threshold in carbon-nanotube-epoxy composites	904
Charge Photogeneration in Organic Solar Cells	879
An improved experimental determination of external photoluminescence quantum efficiency	856
Fabrication and characterization of carbon nanotube/poly(vinyl alcohol) composites	856
STRUCTURE OF THE INVERTED HEXAGONAL (HII) PHASE,AND NON-LAMELLAR PHASE-TRANSITIONS OF LIPIDS	837
PHASE-EQUILIBRIA BY SIMULATION IN THE GIBBS ENSEMBLE - ALTERNATIVE DERIVATION,GENERALIZATION AND APPLICATION TO MIXTURE AND MEMBRANE EQUILIBRIA	825
Materials and Applications for Large Area Electronics: Solution-Based Approaches	814
Novel olefin polymerization catalysts based on iron and cobalt	809

Actually Prof Tom Welton’s ionic liquid review has a higher citation number than the force-field paper. However, he was no longer on the chemistry departmental staff list at the time when I crawled it.

Prof vs Doc

I would expect there is some correlation between the year of publication and the number of citations (the highest cited force-field paper removed):

df <- list2df(main, c('title', 'year', 'num_citations', 'from'))
ggplot(df, aes(year, num_citations)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylim(0, 2500)

plot of chunk unnamed-chunk-11

Nope, doesn’t seem like it. But maybe professors have in a general a higher citation rate than docs?

df$isProf <- str_detect(toupper(df$from), "PROFESSOR")
ggplot(data = df, mapping = aes(x = num_citations)) + facet_grid(isProf~., scale = 'free') + geom_histogram(data = subset(df, isProf == F), binwidth = 1, fill = "red", alpha = 0.5) + geom_histogram(data = subset(df, isProf == T), binwidth = 1 , fill = "blue", alpha = 0.5) + xlim(0, 250) + ylim(0, 300)

plot of chunk unnamed-chunk-12

A clear indication of difference would be a shift of the bottom (professor num_citations) distribution to the right w.r.t the top distribution, which is not seen. However, there is a more visible tail for the bottom distribution than the top. This indicates professors tend to have more higher cited papers.

In addition, the bottom distribution is taller than the top one, mind you this is despite the fact that only about one third of the department are professors. This means an indicator of professors is that they tend to have published more than doctors. A clearer way to demonstrate this is to compare the distributions of the number of publications for professors and doctors.

q <- aggregate(df$from, by = list(df$from), length)
colnames(q) <- c('name', 'num_publications')
q$isProf <- str_detect(toupper(q$name), "PROFESSOR")
ggplot(data = q, mapping = aes(x = num_publications)) + facet_grid(isProf~., scale = 'free') + geom_histogram(data = subset(q, isProf == F), fill = "red", alpha = 0.5) + geom_histogram(data = subset(q, isProf == T), fill = "blue", alpha = 0.5) + ylab("Number of Profs / Docs")

plot of chunk unnamed-chunk-13

Internal Collaborations

Is there a lot of internal collaborations between staffs? Who collaborates with whom within IC chem? One way to look at that is to tally which staffs have the same publications.

df <- list2df(main, c('from', 'title'))
df$title <- iconv(df$title,"WINDOWS-1252","UTF-8")
df$title <- toupper(df$title)
q <- unique(df$title)
q <-sapply(q, function(i){
    length(unique(df$from[df$title == i]))
})
p <- names(q[q > 1])
s <- unique(df$from[df$title %in% p])

Actually, 1162 pieces of work were collaborative efforts. That is quite a large proportion. 100 staffs were involved in internal collaborations, almost the whole department.

A heatmap can show which researchers work together. Each row and column with a index represents a staff, where the staffs are sorted alphabetically by name. The table below the heatmap shows the index and name mapping. The heatmap is symmetrical w.r.t the diagonal. In the heatmap, the lighter the colour, the more collaboration between a pair of staffs, also the higher the number in the cell.

heatmap <- function(matrix, xlabel = NULL, ylabel = NULL, fill = NULL, color= NULL, text_size = 7){
    matrix <- as.data.frame(matrix)
    colors <- c("steelblue", "seagreen", "darkorchid", "maroon")
    melt <- melt(matrix)
    if(ncol(melt) == 2){
        melt$variable1 <- melt$variable
        melt$variable2 <- rep(rownames(matrix), nrow(matrix))
        melt$variable1 <- factor(melt$variable1, levels = (unique(as.character(rownames(matrix)))))
        melt$variable2 <- factor(melt$variable2, levels = rev(unique(as.character(rownames(matrix)))))
    }
    p <- ggplot(melt,aes(x = variable1, y = variable2)) + geom_tile(aes(fill = value), colour = "white") + scale_fill_gradient(high = "white", low = "black")
    p <- p + geom_text(aes(fill = value, label = round(value,2)), size = text_size) + labs(x = xlabel, y = ylabel, fill = fill) + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none")
}

staffs <- as.character(sort(s))
matrix <- matrix(0, nrow = length(staffs), ncol = length(staffs))
colnames(matrix) <- rownames(matrix) <- staffs
collaborations <- lapply(p,function(i) df$from[(df$title == i)] )
for(i in collaborations){
    for(j in 1:(length(i) - 1)){
        for(k in (j + 1):length(i)){
            r <- which(i[j] == staffs)
            c <- which(i[k] == staffs)
            if(r != c)
                matrix[r, c] <- matrix[c, r] <- 1 + matrix[r, c]
        }
    }
}    
#diag(matrix) <- 0
diagnol <- data.frame(x1 = 100, y1 = 1, x2 = 1, y2 = 100)
colnames(matrix) <- rownames(matrix) <- 1:length(staffs)
p <- heatmap(matrix, text_size = 3)
p + geom_segment(data = diagnol, aes(x = x1, y = y1, xend = x2, yend = y2, colour = "white"))

plot of chunk unnamed-chunk-15

kable(data.frame(Name = staffs, Index = (1:length(staffs))), row.names = F)

Name	Index
Dr Abby Casey	1
Dr Adam R J Clancy	2
Dr Agostino Cilibrizzi	3
Dr Aleksandar Ivanov	4
Dr Alexander J W Thom	5
Dr Alexandra Simperler FHEA	6
Dr Andrew E Ashley	7
Dr Artem Bakulin	8
Dr Balaji Purushothaman	9
Dr Barbara Montanari	10
Dr Beinn Muir	11
Dr Charles R E Romain	12
Dr Colin R Crick	13
Dr David A Widdowson	14
Dr Duncan R Casey	15
Dr Elena M Barreiro	16
Dr Garry Rumbles	17
Dr George J P Britovsek	18
Dr Giuseppe Mallia	19
Dr Guido Bolognesi	20
Dr Hannah S Leese	21
Dr Ian R Gould	22
Dr James A Bull	23
Dr James D Wilton-Ely	24
Dr James H Bannock	25
Dr Jennifer A Garden	26
Dr Joao Pedro B C Malhado	27
Dr Jordi Bures Amat	28
Dr Joshua B Edel	29
Dr Laura M C Barter	30
Dr Leonora Velleman	31
Dr Luke X Reynolds	32
Dr Marina K Kuimova	33
Dr Mark R Crimmin	34
Dr Markus Ritzefeld	35
Dr Matthew J Fuchter	36
Dr Mimi Hii	37
Dr Mustafa K Bayazit	38
Dr Nicholas J Brooks	39
Dr Olga Kuzmina	40
Dr Oscar Ces	41
Dr Patricia A Hunt	42
Dr Paul Wilde	43
Dr Philip Miller	44
Dr Pierre Boufflet	45
Dr Remigiusz A Serwa	46
Dr Rob Davies	47
Dr Robert Menzel	48
Dr Robert V Law	49
Dr Rudiger Woscholski	50
Dr Saif A Haque	51
Dr Sanjiv Sharma	52
Dr Silvia Diez-Gonzalez	53
Dr Stoichko Dimitrov	54
Dr Thomas Lanyon-Hogg	55
Dr Thomas Rath	56
Dr Tim Albrecht	57
Dr Xiaoe Li	58
Dr Yuval Elani	59
Dr Zhuping Fei	60
Emeritus Professor Andrew B Holmes AM FRS FAA FTSE	61
Emeritus Professor David Phillips	62
Emeritus Professor Henry S Rzepa	63
Emeritus Professor Michael Spiro	64
Emeritus Professor William P Griffith	65
Mr Yoni Weiner	66
Ms Elisa Collado Fregoso	67
Professor Alan Armstrong	68
Professor Alan Spivey	69
Professor Alexei A Kornyshev	70
Professor Anthony G M Barrett FRS, FMedSci	71
Professor Anthony R J Kucernak	72
Professor Charlotte K Williams	73
Professor Chris Braddock	74
Professor David M L Goodgame	75
Professor David R Klug	76
Professor David Scheschkewitz	77
Professor Ed Tate BSc PhD FRSC FRSB	78
Professor Fernando Bresme	79
Professor Iain McCulloch	80
Professor James R Durrant	81
Professor John C de Mello	82
Professor John M Seddon	83
Professor Keith Willison	84
Professor Martin J Heeney	85
Professor Michael J Bearpark	86
Professor Mike Robb FRS	87
Professor Milo S P Shaffer	88
Professor Nicholas J Long	89
Professor Nicholas M Harrison	90
Professor Nick Quirke	91
Professor Paul D Lickiss	92
Professor Philip J Parsons	93
Professor Ramon Vilar Compte	94
Professor Richard Templer	95
Professor Robin J Leatherbarrow	96
Professor Sophia Yaliraki	97
Professor Sue Gibson	98
Professor Tony Cass	99
Professor Tony Gee	100

One can see Prof Heaney and Prof McCulloch has collaborated the most. Most of the collaborations were duos. In the most extreme case, 6 staffs collaborated on 5 different piece of work:

ggplot(data.frame(x = sapply(collaborations, function(i) length(i))), aes(x) ) + geom_bar() +  geom_text(aes(y = ..count.. , label = ..count..), stat = "count", vjust = -1) + scale_y_log10() + scale_x_discrete() + xlab("Number of staffs contributed to the paper")

plot of chunk unnamed-chunk-16

Another way to view the data is through a graph network, where nodes represents the staffs and edges indicates whether there is at least one collaboration between the two nodes. The network is obviously too dense to see the details, but a general rule is that peripheral nodes have less collaborators (e.g. node 64 and 65, whom, as the records suggest have only collaborated with each other and no other staffs). While central nodes represent staffs who collaborate with many different people.

adjMat <- matrix > 0
net <- graph_from_adjacency_matrix(adjMat)
plot(net, edge.arrow.size = 0,  vertex.size = 5, vertex.label.cex = 0.5)

plot of chunk unnamed-chunk-17

There is a lot a lot a lot more that can be done with the data, employing more sophisticated analysis, which I might venture into if one day I feel like it. The natural language processing done here is very crude and hence the numbers here are by no means absoluatly accurate. But I believe the general inferences are sound.

To see the complete Rmarkdown code for this analysis visit here.

To have this published on my blog the Rmarkdown needs to be translated into markdown. For this I used and modified Andrew Brooks’ R function.

Until next time…

Exploratory Data Analysis of IC Chemistry Publication Records

Staffs

Publication

Citations

Prof vs Doc

Internal Collaborations