WebCrowd25K Dataset

The dataset includes three related parts:

  • Crowd Relevance Judgments. 25,099 information retrieval relevance judgments collected on Amazon's Mechanical Turk platform. For each of the 50 search topics from the 2014 NIST TREC WebTrack, we selected 100 ClueWeb12 documents to be re-judged (without reference to the original TREC assessor judgment) by 5 MTurk workers each (50 topics x 100 documents x 5 workers = 25K crowd judgments). Individual worker IDs from the platform are hashed to new identifiers. We collect relevance judgments on a 4-point graded scale. (See SIGIR'18 & HCOMP'18 papers).
  • Behavioral Data. For a subset of the judgments, we also collected behavioral data charactering worker behavior in performing the relevance judging. Behavioral data was recorded using MmmTurkey, which captures a variety of worker interaction behaviors while completing MTurk Human Intelligence Tasks. (See HCOMP'18 paper)
  • Disagreement Analysis. We inspected 1000 crowd judgments for 200 documents (5 judgments per document, where the aggregated crowd judgment differs from the original TREC assessor judgment), and we classified each disagreement according to our disagreement taxonomy. (See SIGIR'18 paper.)

You can download the entire dataset here. Please refer to the included README files and associated publications for further details.

