• Congressional Floor-debates Data

    Positive/negative-labeled documents; agree/disagree classification output, etc. Useful for work on sentiment analysis.
  • Cornell Movie-Dialogs Corpus

    A metadata-rich collection of fictional conversations extracted from raw movie scripts.
  • Cornell Movie-Quotes Corpus

    A collection of movie lines together with memorability annotations.
  • Cornell Natural Language Visual Reasoning (NLVR) Dataset

    Cornell NLVR is a natural language grounding dataset with 92K examples. Each example shows an image and a sentence desciribing it, and is annotated with the truth-value of the sentence.
  • Cornell Newsroom

    Cornell Newsroom is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.
  • Diplomacy Betrayal Dataset

    A collection of interactions between allies in online Diplomacy games, with betrayal labels.
  • English Verb-object Co-occurrences

    Drawn from newswire text. Useful for work on distributional similarity.
  • Intelligence Squared Debate Dataset

    A collection of transcripts and metadata for debates from the series "Intelligence Squared Debates". For each debate, the transcript of each turn is given, along with information such as voting results pre- and post-debate, and audience reaction markers.
  • Movie-review Sentiment-analysis Data

    Sometimes referred to as the "Cornell movie-review corpus". Positive/negative- and “number-of-stars”-labeled documents; positive/negative and subjective/objective-labeled sentences, etc. Useful for work on sentiment analysis.
  • NuPrl Verbalizations

    Multiple (multi-parallel) English versions of computer-generated proofs; induced paraphrase thesaurus, etc. Useful for work on data-driven generation and paraphrasing.
  • Politeness Web App, API and Data

    Check out how polite your requests are by using this web app. Includes link to code and data.
  • QUOTUS Data and Visualization

    A collection of Obama's quotes of by news outlets, their location within the source White House speech, and information about the article in which the quotes were cited. A neat visualization is included.
  • Supreme Court Dialogs Corpus

    A collection of conversations from the U.S. Supreme Court Oral Arguments with metadata (including votes, case outcome and gender).
  • Tennis Transcript and Commentary Dataset

    Transcripts for tennis singles post-match press conferences for major tournaments.
  • Wikipedia Talk Page Conversations Corpus

    A collection of conversations from Wikipedia editor's Talk Pages with metadata (including status and gender).