The Teahouse corpus is a set of questions asked at the Wikipedia Teahouse, a peer support forum for new Wikipedia editors. This corpus contains data from its first two years of operation.
The Teahouse started as an [editor engagement] (https://en.wikipedia.org/wiki/Wikipedia:Editor_engagement) initiative and Fellowship project. It was launched in February 2012 by a small team working with the Wikimedia Foundation. Our intention was to pilot a new, scalable model for teaching Wikipedia newcomers the ropes of editing in a friendly and engaging environment.
The ultimate goal of the pilot project was to increase the retention of new Wikipedia editors (most of whom give up and leave within their first 24 hours post-registration) through early proactive outreach. The project was particularly focused on retaining female newcomers, who are woefully underrepresented among the regular contributors to the encyclopedia.
The Teahouse lives on as an vibrant, self-sustaining and community-driven project. All Teahouse participants are volunteers: no one is told when, how, or how much they must contribute.
See the README files associated with each datafile for a schema of the data fields in that file.
Read on for more info on potential applications, the provenance of these data, and links to related resources.
Potential Applications
or, what is it good for?
The Teahouse corpus consists of good quality data and rich metadata around social Q&A interactions in a particular setting: new user help requests in a large, collaborative online community.
More generally, this corpus is a valuable resource for research on conversational dynamics in online, asynchronous discussions.
Qualitative textual analysis could yield insights into the kinds of issues faced by newcomers in established online collaborations.
Linguisitc analysis could examine the impact of syntactic and semantic features related to politeness, sentiment, question framing, or other rhetorical strategies on discussion outcomes.
Response patterns (questioner replies and answers) within each thread could be used to map network relationships, or to investigate correlations between participation by the initiator of a thread, or the number of participants, on thread length or interactivity (the interval of time between posts).
The corpus is large and rich enough to provide training both training and test data for machine learning applications.
Finally, the data provide here can be extended and compared with other publicly-available datasets of Wikipedia, allowing researchers to examine relationships between editors' participation within the Teahouse Q&A forum and their previous, concurrent, and subsequent editing activities within millions of other articles, meta-content, and discussion spaces on Wikipedia.
Data hygiene
or, how the research sausage was made
Parsing wikitext presents many challenges: the mediawiki editing interface is deliberately underspecified in order to maximize flexibility for contributors. This can make it difficult to tell the difference between different types of contribution--say, fixing a typo or answering a question.
The Teahouse Q&A board was designed to provide a more structured workflow than normal wiki talk pages, and instrumented to identify certain kinds of contributions (questions and answers) and isolate them from the 'noisy' background datastream of incidental edits to the Q&A page. The post-processing of the data presented here favored precision over recall: to provide a good quality set of questions, rather than a complete one.
In cases where it wasn't easy to identify whether an edit contained a question or answer, these data have not been included. However, it is hard to account for all ambiguous or invalid cases: caveat quaesitor!
Our approach to data inclusion was conservative. The number of questioner replies and answers to any given question may be under-counted, but is unlikely to be over-counted. However, our spot checks and analysis of the data suggest that the majority of responses are accounted for, and that the distribution of "missed" responses is randomly distributed.
The Teahouse corpus only contains questions and answers by registered users of Wikipedia who were logged in when they participated. IP addresses can be linked to an individual's physical location. On Wikipedia, edits by logged out and unregistered users are identified by the user's current IP address. Although all edits to Wikipedia are legally public and free licenced, we have redacted IP edits from this dataset in deference to user privacy. Researchers interested in those data can find them in other public Wikipedia datasets.
Possible future additions
Additional data about these Q&A interactions has been collected, and other data are retrievable. Examples of data that could be included in future revisions of the corpus at low cost include:
- more metadata about the people asking questions:
- how many edits had they made before asking their (first) question?
- when did they join Wikipedia?
- were they explicitly invited to participate in the Teahouse, or did they locate the forum by other means?
- did the questioner also create a guest profile on the Teahouse introductions page?
- more metadata about the people answering the questions:
- were they a Teahouse host at the time they answered a question?
Examples of data that could be included in future revisions of the corpus at reasonable cost:
- full text of answers to questions, including replies by original questioner
- full text of profiles created by Teahouse guests and hosts (some privacy considerations here; contact corpus maintainer directly if interested in these data)
See also