SQUARE: A Benchmark for Research on Computing Crowd Consensus

Datasets

All datasets

Binary Classification
BM - Positive/negative sentiment labels for tweets.
HCB - Relevance judgments for pairs of search queries and Web pages.
RTE - Judgments for textual entailment.
SpamCF - Judgments about whether or not an AMT HIT should be considered a "spam" task.
TEMP - Judgments for temporal ordering of events in text.
WB - Judgments indicating whether or not a waterbird image shows a duck.
WVSCM - Judgments distinguishing whether or not face images smile.

Ordinal Regression
AC2 - Judgments for website (ordinal) ratings.
HC - Graded relevance judgments for pairs of search queries and Web pages into ordinal categories.

Multiple Choice
WSD - Ternary judgments for selecting the right sense of word for the given example usage.

The National Science Foundation

For questions and comments email