I'm returning to this blog to make some notes about a new project, a paper on 'Tweeting Science'. It began as a paper at 'Writing Research Across Borders II' at George Mason University (abstract here, see p. 33). That was almost a year ago, so you see how long it takes me to write something up for publication. Twitter was an afterthought in a proposed paper on blogs, but as I worked on it it seemed like the more interesting data. Twitter also seems more relevant to what most scientists are doing now. Science blogs were big a few years ago; the science blogosphere began to fragment in June 2010, and many of the bloggers I was following are no longer posting regularly. Those that remain have become much more straightforward professional science journalism. But all the tweeters I studied last year are still at it, regularly, and Twitter seems like a key part of their professonal practice. So I am going back to those tweeters, adding more, and studying Twitter practices more centrally.
More about my research questions for next week. This week I was mainly working on getting together a corpus. The corpus I had for WRAB was, as with many rushed conference presentations, too small and too careless. I had simply copied and pasted the feed of tweets. But there are several problems with this:
- I lost the conversations with other tweeters - probably irrelevant for the corpus as a database, but problematic when I later want to come back and look more closely at specific tweets.
- I included the words that are put there by Twitter, what I am calling the frame: for example Reply, Retweet, Favorite, dates. I thought I could just get away with ignoring these very frequent words, but of course they throw off the relative frequencies of other words.
- I hadn't been very systematic in my choice of Twitter feeds. As it turns out, they were good choices, but I didn't think enough about criteria for inclusion.
So I went back to those feeds, all still very active, deleted one, and added more. I started with the 'science' categories in WeFollow and Science Pond by Sawhorse Media. My criteria were:
- must be a practicing researcher (not necessarily at a university of course); the science tweets with the most followers are usually science jorunalists or communicators, and I needed an active research life for reasons that will be clear.
- must post regularly, so Twitter is clearly a part of their routine. Just makes it easier to collect enough words.
- more than 1000 followers and (usually) less than 5000. Tweeters with low numbers of followers are typically doing something different (for instance, social arrangements with friends) where those with over 5000 are usually associated with publishers and institutions, and are serving as a kind of popular science node.
- A range in terms of disciplines (Physics, Biology, Geology, Psychology).
- Men and women, at different stages of their careers from PhD student to FRS, in North America and the UK (because those were the areas where most of the science tweeters were found). All mainly in English, though some of the tweeters are clearly multilingual.
I came up with these (with summaries of their Twitter profiles):
@AtheneDonald - Athene Donald, Professor of Physics at Cambridge
@attilacsordas - Attila Csordas, bioinformatician at European Bioinformatics Institute
@deevybee - Dorothy Bishop, Professor of Experimental Psychology at Oxford
@DNLee5 - Banielle N. Lee, biologist, urban ecologist, and science outreach, Scientific American blogger, apparently at a university in Oklahoma
@highlyanne - Anne jefferson, watershed hydrologist at University of North Carolina at Charlotte
@scicurious - anonymous but very well known blogger (currently Neurotic Physiology at http://scientopia.org/blogs/scicurious/). 'She is currently a post-doctoral researcher at a celebrated institution that is very fancy'
@clasticdetritus - Brian Romans, assistant professor of Geology at Virginia Tech (industry experience before that), another well-known blogger
@DoctorZen - Zen Faulkes, Associate professor at The University of Texas-Pan American, studies the neurophysiology of crustaceans and has several blogs (including an excellent one critiquing conference posters)
@sc_k - Sarah Kavassalis, 'Permanent student of mathematics, physics and, sometimes, the philosophy of their intersection' (Department of Physics, University of Toronto)
@aetiology - Tara C. Smith, Assistant Professor of Epidemiology, University of Iowa
@KamounLab - The group of Sophien Kamoun at The Sainsbury Lab, Norwich UK (Biology).
@systemsbiology - Steve Watterson, a research fellow at Edinburgh Uni working in systems biology.
I realised I was creating a kind of snowball sample, since I was following up people who posted interesting tweets on a page I was sampling. So each of these feeds is followed by at least one of the others, and in the case of @scicurious, most of the others. But I don't think that overlapping is a problem for what I am studying. I would like to add a wider range of disciplines - maybe chemistry, some medical imaging, other areas of physics - but I can do that when I scale up from 50K to 100K words.
Now the reference corpus.
Great to see you blogging on this topic, Greg.
I'm curious about how you deal with the ethical issues around using data from Twitter. Do you feel it's not an issue as these are public feeds (and clearly intended for a wide audience rather than friends)? I'm always concerned about what happens when someone deletes a tweet from Twitter, but it survives in a corpus/published paper.
Posted by: Johnnyunger | February 23, 2012 at 04:05 PM