Having crawled the data in the previous post, I can extract different portions of it from my database and cleanse it to generate some plots and summary statistics.
First, load packages and establish connection to MongoDB (here my db is located on the default port on my localhost).
library(rmongodb)
library(stringr)
library(ggplot2)
library(knitr)
library(gridExtra)
library(reshape)
library(igraph)
mongo <- mongo.create()
if(mongo.is.connected(mongo) == TRUE) {
db <- mongo.get.databases(mongo)
coll <- mongo.get.database.collections(mongo, db)
}
Staffs
The records are obtained from 120 staffs. We can look at the relative proportion of Doctors and Professors in the department.
df <- data.frame(names = mongo.distinct(mongo, coll, 'from'))
df$isProf <- str_detect(toupper(df$names), "PROFESSOR")
ggplot(df, aes(x = factor(1), fill = isProf) ) + geom_bar() + coord_polar(theta = "y")+ theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank())
Publication
There are in total 9417 entries. The entries are of the following types:
## [1] "JOURNAL ARTICLE" "CONFERENCE PAPER" "BOOK CHAPTER"
## [4] "PATENT" "REPORT" "SOFTWARE"
## [7] "OTHER" "BOOK" "POSTER"
We can take a look at how many entries of each publication type are there, note the y-axis is on a log scale:
df <- mongo.find.all(mongo, coll, fields = list('publication-type' = 1, '_id' = 0), data.frame = T)
ggplot(df, aes(x = publication.type) ) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", vjust = -1)+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_log10(breaks = c(1,10,100,1000,10000))
It is most interesting to look at only journal & conference publications, which are the vast majority of entries anyways. There are 8953 such entries. Bear in mind there are duplicate entries, where IC chemistry staffs collaborated to produce a piece of work.
main <- mongo.find.all(mongo, coll, query=list("$or" = list(list("publication-type" = "JOURNAL ARTICLE"), list("publication-type" = "CONFERENCE PAPER"))), fields = list('publication-type' = 1, 'title' = 1, 'num_citations' = 1, 'journal' = 1, 'year' = 1, 'from' = 1, '_id' = 0))
# Only keeps data entries that have all the header fields defined.
list2df <- function(list, headers){
output = list()
for(i in 1:length(list)){
record = T
for(j in headers){
if(is.null(list[[i]][[j]])){
record = F
break
}
}
if(record == T){
for(j in headers)
output[[j]] <- c(output[[j]], list[[i]][[j]])
}
}
return(as.data.frame(output))
}
df <- list2df(main, c('journal', 'title'))
The 8116 journal submissions are delivered to 1102 different journals. A coarse look at the number of submissions to each of the journals. Have a guess what is the most popular journal(the tallest bar). Bear in mind multiple academics could be have drafted the same paper, this needs to be accounted for. I have listed the top 10 most and (some of the) least popular journals.
df <- df[!duplicated(df$title), ]
ggplot(df, aes(x = journal) ) + geom_bar() + theme(axis.text.x=element_blank(),axis.ticks=element_blank())
q <- aggregate(df$journal, by=list(df$journal), length)
colnames(q) <- c("Journal.Names", "Submission")
q <- q[order(q$Submission, decreasing = T),]
kable(tail(q, 10), row.names = F)
Journal.Names | Submission |
---|---|
VIBRATIONAL SPECTROSCOPY | 1 |
WATER ENVIRONMENT RESEARCH | 1 |
WATER SCIENCE AND TECHNOLOGY | 1 |
WEAR | 1 |
WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE | 1 |
WILEY INTERDISCIPLINARY REVIEWS-ENERGY AND ENVIRONMENT | 1 |
ZEITSCHRIFT FUR ANORGANISCHE UND ALLGEMEINE CHEMIE | 1 |
ZEITSCHRIFT FUR PHYSIK B-CONDENSED MATTER | 1 |
Zeitschrift für Physikalische Chemie | 1 |
ZHURNAL EKSPERIMENTALNOI I TEORETICHESKOI FIZIKI | 1 |
kable(head(q, 10), row.names = F)
Journal.Names | Submission |
---|---|
CHEMICAL COMMUNICATIONS | 249 |
JOURNAL OF THE AMERICAN CHEMICAL SOCIETY | 231 |
JOURNAL OF THE CHEMICAL SOCIETY-CHEMICAL COMMUNICATIONS | 167 |
JOURNAL OF ORGANIC CHEMISTRY | 165 |
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY | 135 |
TETRAHEDRON LETTERS | 135 |
JOURNAL OF CHEMICAL PHYSICS | 123 |
PHYSICAL CHEMISTRY CHEMICAL PHYSICS | 117 |
JOURNAL OF THE CHEMICAL SOCIETY-PERKIN TRANSACTIONS 1 | 112 |
JOURNAL OF THE CHEMICAL SOCIETY-DALTON TRANSACTIONS | 106 |
Let’s look at the number of submissions to the top 20 most popular journals:
q$Journal.Names <- factor(q$Journal.Names, levels = q$Journal.Names)
ggplot(q[1:20, ], aes(x = Journal.Names, y = Submission)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
There does not seem to be a significant bias towards any of the three areas of chemistry.
Citations
df <- list2df(main, c('title', 'num_citations'))
df$title <- str_trim(df$title)
df <- df[!duplicated(df$title), ]
df <- df[order(df$num_citations, decreasing = T), ]
ggplot(df, aes(x = num_citations)) + geom_bar() + xlim(0, 250) + ylim(0,300)
There are some ‘citation beasts’ - outliers not included in the above bar chart:
title | num_citations |
---|---|
A 2ND GENERATION FORCE-FIELD FOR THE SIMULATION OF PROTEINS,NUCLEIC-ACIDS,AND ORGANIC-MOLECULES | 7810 |
The path forward for biofuels and biomaterials | 2373 |
A strong regioregularity effect in self-organizing conjugated polymer films and high-efficiency polythiophene: fullerene solar cells | 1647 |
The search for new-generation olefin polymerization catalysts: Life beyond metallocenes | 1608 |
N-Heterocyclic Carbenes in Late Transition Metal Catalysis | 1438 |
Liquid-crystalline semiconducting polymers with high charge-carrier mobility | 1271 |
TETRAPROPYLAMMONIUM PERRUTHENATE,PR4N+RUO4-,TPAP - A CATALYTIC OXIDANT FOR ORGANIC-SYNTHESIS | 1243 |
FERROCENE-MEDIATED ENZYME ELECTRODE FOR AMPEROMETRIC DETERMINATION OF GLUCOSE | 1224 |
ORGANOMETALLIC COMPOUNDS FOR NONLINEAR OPTICS - THE SEARCH FOR EN-LIGHT-ENMENT | 1221 |
Synthesis and function of 3-phosphorylated inositol lipids | 1101 |
N-heterocyclic carbenes as organocatalysts | 966 |
Development of a dispersion process for carbon nanotubes in an epoxy matrix and the resulting electrical properties | 944 |
Ultra-low electrical percolation threshold in carbon-nanotube-epoxy composites | 904 |
Charge Photogeneration in Organic Solar Cells | 879 |
An improved experimental determination of external photoluminescence quantum efficiency | 856 |
Fabrication and characterization of carbon nanotube/poly(vinyl alcohol) composites | 856 |
STRUCTURE OF THE INVERTED HEXAGONAL (HII) PHASE,AND NON-LAMELLAR PHASE-TRANSITIONS OF LIPIDS | 837 |
PHASE-EQUILIBRIA BY SIMULATION IN THE GIBBS ENSEMBLE - ALTERNATIVE DERIVATION,GENERALIZATION AND APPLICATION TO MIXTURE AND MEMBRANE EQUILIBRIA | 825 |
Materials and Applications for Large Area Electronics: Solution-Based Approaches | 814 |
Novel olefin polymerization catalysts based on iron and cobalt | 809 |
Actually Prof Tom Welton’s ionic liquid review has a higher citation number than the force-field paper. However, he was no longer on the chemistry departmental staff list at the time when I crawled it.
Prof vs Doc
I would expect there is some correlation between the year of publication and the number of citations (the highest cited force-field paper removed):
df <- list2df(main, c('title', 'year', 'num_citations', 'from'))
ggplot(df, aes(year, num_citations)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ylim(0, 2500)
Nope, doesn’t seem like it. But maybe professors have in a general a higher citation rate than docs?
df$isProf <- str_detect(toupper(df$from), "PROFESSOR")
ggplot(data = df, mapping = aes(x = num_citations)) + facet_grid(isProf~., scale = 'free') + geom_histogram(data = subset(df, isProf == F), binwidth = 1, fill = "red", alpha = 0.5) + geom_histogram(data = subset(df, isProf == T), binwidth = 1 , fill = "blue", alpha = 0.5) + xlim(0, 250) + ylim(0, 300)
A clear indication of difference would be a shift of the bottom (professor num_citations) distribution to the right w.r.t the top distribution, which is not seen. However, there is a more visible tail for the bottom distribution than the top. This indicates professors tend to have more higher cited papers.
In addition, the bottom distribution is taller than the top one, mind you this is despite the fact that only about one third of the department are professors. This means an indicator of professors is that they tend to have published more than doctors. A clearer way to demonstrate this is to compare the distributions of the number of publications for professors and doctors.
q <- aggregate(df$from, by = list(df$from), length)
colnames(q) <- c('name', 'num_publications')
q$isProf <- str_detect(toupper(q$name), "PROFESSOR")
ggplot(data = q, mapping = aes(x = num_publications)) + facet_grid(isProf~., scale = 'free') + geom_histogram(data = subset(q, isProf == F), fill = "red", alpha = 0.5) + geom_histogram(data = subset(q, isProf == T), fill = "blue", alpha = 0.5) + ylab("Number of Profs / Docs")
Internal Collaborations
Is there a lot of internal collaborations between staffs? Who collaborates with whom within IC chem? One way to look at that is to tally which staffs have the same publications.
df <- list2df(main, c('from', 'title'))
df$title <- iconv(df$title,"WINDOWS-1252","UTF-8")
df$title <- toupper(df$title)
q <- unique(df$title)
q <-sapply(q, function(i){
length(unique(df$from[df$title == i]))
})
p <- names(q[q > 1])
s <- unique(df$from[df$title %in% p])
Actually, 1162 pieces of work were collaborative efforts. That is quite a large proportion. 100 staffs were involved in internal collaborations, almost the whole department.
A heatmap can show which researchers work together. Each row and column with a index represents a staff, where the staffs are sorted alphabetically by name. The table below the heatmap shows the index and name mapping. The heatmap is symmetrical w.r.t the diagonal. In the heatmap, the lighter the colour, the more collaboration between a pair of staffs, also the higher the number in the cell.
heatmap <- function(matrix, xlabel = NULL, ylabel = NULL, fill = NULL, color= NULL, text_size = 7){
matrix <- as.data.frame(matrix)
colors <- c("steelblue", "seagreen", "darkorchid", "maroon")
melt <- melt(matrix)
if(ncol(melt) == 2){
melt$variable1 <- melt$variable
melt$variable2 <- rep(rownames(matrix), nrow(matrix))
melt$variable1 <- factor(melt$variable1, levels = (unique(as.character(rownames(matrix)))))
melt$variable2 <- factor(melt$variable2, levels = rev(unique(as.character(rownames(matrix)))))
}
p <- ggplot(melt,aes(x = variable1, y = variable2)) + geom_tile(aes(fill = value), colour = "white") + scale_fill_gradient(high = "white", low = "black")
p <- p + geom_text(aes(fill = value, label = round(value,2)), size = text_size) + labs(x = xlabel, y = ylabel, fill = fill) + theme(axis.text.x = element_text(angle = 90)) + theme(legend.position = "none")
}
staffs <- as.character(sort(s))
matrix <- matrix(0, nrow = length(staffs), ncol = length(staffs))
colnames(matrix) <- rownames(matrix) <- staffs
collaborations <- lapply(p,function(i) df$from[(df$title == i)] )
for(i in collaborations){
for(j in 1:(length(i) - 1)){
for(k in (j + 1):length(i)){
r <- which(i[j] == staffs)
c <- which(i[k] == staffs)
if(r != c)
matrix[r, c] <- matrix[c, r] <- 1 + matrix[r, c]
}
}
}
#diag(matrix) <- 0
diagnol <- data.frame(x1 = 100, y1 = 1, x2 = 1, y2 = 100)
colnames(matrix) <- rownames(matrix) <- 1:length(staffs)
p <- heatmap(matrix, text_size = 3)
p + geom_segment(data = diagnol, aes(x = x1, y = y1, xend = x2, yend = y2, colour = "white"))
kable(data.frame(Name = staffs, Index = (1:length(staffs))), row.names = F)
Name | Index |
---|---|
Dr Abby Casey | 1 |
Dr Adam R J Clancy | 2 |
Dr Agostino Cilibrizzi | 3 |
Dr Aleksandar Ivanov | 4 |
Dr Alexander J W Thom | 5 |
Dr Alexandra Simperler FHEA | 6 |
Dr Andrew E Ashley | 7 |
Dr Artem Bakulin | 8 |
Dr Balaji Purushothaman | 9 |
Dr Barbara Montanari | 10 |
Dr Beinn Muir | 11 |
Dr Charles R E Romain | 12 |
Dr Colin R Crick | 13 |
Dr David A Widdowson | 14 |
Dr Duncan R Casey | 15 |
Dr Elena M Barreiro | 16 |
Dr Garry Rumbles | 17 |
Dr George J P Britovsek | 18 |
Dr Giuseppe Mallia | 19 |
Dr Guido Bolognesi | 20 |
Dr Hannah S Leese | 21 |
Dr Ian R Gould | 22 |
Dr James A Bull | 23 |
Dr James D Wilton-Ely | 24 |
Dr James H Bannock | 25 |
Dr Jennifer A Garden | 26 |
Dr Joao Pedro B C Malhado | 27 |
Dr Jordi Bures Amat | 28 |
Dr Joshua B Edel | 29 |
Dr Laura M C Barter | 30 |
Dr Leonora Velleman | 31 |
Dr Luke X Reynolds | 32 |
Dr Marina K Kuimova | 33 |
Dr Mark R Crimmin | 34 |
Dr Markus Ritzefeld | 35 |
Dr Matthew J Fuchter | 36 |
Dr Mimi Hii | 37 |
Dr Mustafa K Bayazit | 38 |
Dr Nicholas J Brooks | 39 |
Dr Olga Kuzmina | 40 |
Dr Oscar Ces | 41 |
Dr Patricia A Hunt | 42 |
Dr Paul Wilde | 43 |
Dr Philip Miller | 44 |
Dr Pierre Boufflet | 45 |
Dr Remigiusz A Serwa | 46 |
Dr Rob Davies | 47 |
Dr Robert Menzel | 48 |
Dr Robert V Law | 49 |
Dr Rudiger Woscholski | 50 |
Dr Saif A Haque | 51 |
Dr Sanjiv Sharma | 52 |
Dr Silvia Diez-Gonzalez | 53 |
Dr Stoichko Dimitrov | 54 |
Dr Thomas Lanyon-Hogg | 55 |
Dr Thomas Rath | 56 |
Dr Tim Albrecht | 57 |
Dr Xiaoe Li | 58 |
Dr Yuval Elani | 59 |
Dr Zhuping Fei | 60 |
Emeritus Professor Andrew B Holmes AM FRS FAA FTSE | 61 |
Emeritus Professor David Phillips | 62 |
Emeritus Professor Henry S Rzepa | 63 |
Emeritus Professor Michael Spiro | 64 |
Emeritus Professor William P Griffith | 65 |
Mr Yoni Weiner | 66 |
Ms Elisa Collado Fregoso | 67 |
Professor Alan Armstrong | 68 |
Professor Alan Spivey | 69 |
Professor Alexei A Kornyshev | 70 |
Professor Anthony G M Barrett FRS, FMedSci | 71 |
Professor Anthony R J Kucernak | 72 |
Professor Charlotte K Williams | 73 |
Professor Chris Braddock | 74 |
Professor David M L Goodgame | 75 |
Professor David R Klug | 76 |
Professor David Scheschkewitz | 77 |
Professor Ed Tate BSc PhD FRSC FRSB | 78 |
Professor Fernando Bresme | 79 |
Professor Iain McCulloch | 80 |
Professor James R Durrant | 81 |
Professor John C de Mello | 82 |
Professor John M Seddon | 83 |
Professor Keith Willison | 84 |
Professor Martin J Heeney | 85 |
Professor Michael J Bearpark | 86 |
Professor Mike Robb FRS | 87 |
Professor Milo S P Shaffer | 88 |
Professor Nicholas J Long | 89 |
Professor Nicholas M Harrison | 90 |
Professor Nick Quirke | 91 |
Professor Paul D Lickiss | 92 |
Professor Philip J Parsons | 93 |
Professor Ramon Vilar Compte | 94 |
Professor Richard Templer | 95 |
Professor Robin J Leatherbarrow | 96 |
Professor Sophia Yaliraki | 97 |
Professor Sue Gibson | 98 |
Professor Tony Cass | 99 |
Professor Tony Gee | 100 |
One can see Prof Heaney and Prof McCulloch has collaborated the most. Most of the collaborations were duos. In the most extreme case, 6 staffs collaborated on 5 different piece of work:
ggplot(data.frame(x = sapply(collaborations, function(i) length(i))), aes(x) ) + geom_bar() + geom_text(aes(y = ..count.. , label = ..count..), stat = "count", vjust = -1) + scale_y_log10() + scale_x_discrete() + xlab("Number of staffs contributed to the paper")
Another way to view the data is through a graph network, where nodes represents the staffs and edges indicates whether there is at least one collaboration between the two nodes. The network is obviously too dense to see the details, but a general rule is that peripheral nodes have less collaborators (e.g. node 64 and 65, whom, as the records suggest have only collaborated with each other and no other staffs). While central nodes represent staffs who collaborate with many different people.
adjMat <- matrix > 0
net <- graph_from_adjacency_matrix(adjMat)
plot(net, edge.arrow.size = 0, vertex.size = 5, vertex.label.cex = 0.5)
There is a lot a lot a lot more that can be done with the data, employing more sophisticated analysis, which I might venture into if one day I feel like it. The natural language processing done here is very crude and hence the numbers here are by no means absoluatly accurate. But I believe the general inferences are sound.
To see the complete Rmarkdown code for this analysis visit here.
To have this published on my blog the Rmarkdown needs to be translated into markdown. For this I used and modified Andrew Brooks’ R function.
Until next time…