Eric Lander’s recent piece in Cell on The Heroes of CRISPR has sparked strong reactions that are mostly critical and have argued that the article is biased. I’m going to weigh in with my own thoughts at some point later, but I thought it would be interesting to try a word cloud-based text mining of the Lander piece. See above.
Word clouds indicate which words are used the most by the relative size of the letters of the words. Common words such as “the”, “experiment”, etc. are filtered out to reduce noise. In this case I also filtered out “CRISPR” and “Cas” because they were too big in the cloud making it hard to see the rest.
The names of scientists that show up in the upper cloud are almost all men including Mojica, Zhang, Marraffini, Siksnys, and Horvath. Dr. Charpentier is the only woman whose name makes it into this cloud and with fewer mentions than most of the others.
Making the word cloud settings a bit less stringent now allows (see at the bottom of the post) for the word “Doudna” as in Dr. Jennifer Doudna to appear and “Church” too as in Dr. George Church and Barrangou, but as relatively smaller words meaning not used as often as the others. Note that in this second cloud I removed “genome” and “sequence” as they disrupted the cloud since they were so big.
The number of uses of a particular scientist’s name in a piece overall relative to others could be interpreted to reflect the author’s level of importance given to that scientist. Here are the numbers of times names were used in the piece according to the word cloud tool I used:
- Mojica 23
- Zhang 15
- Marraffini 13
- Charpentier 11
- Siksnys 10
- Horvath 9
- Doudna 8
- Church 8
- Barrangou 8
What do you think of the Lander piece? The strong reaction to it?
What about the word cloud and text mining approach to analyzing papers? Do these Heroes of CRISPR clouds reflect possible bias or just the random nature of words and names used in writing?
Does the inclusion of the reference “Mojica et al” distort to over-represent Mojica? What about references just within the text with references removed?