The Cornell Language in Context (CLIC) Lab maintains its repositories here. Code links for specific publications are available in this list.

The Computational Linguistics Lab maintains its own list of software.

Mallet (maintained by David Mimno) is available here. Mallet is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.


  • Congressional Floor-debates Data
    Positive/negative-labeled documents; agree/disagree classification output, etc. Useful for work on sentiment analysis.
  • Cornell Movie-Dialogs Corpus
    A metadata-rich collection of fictional conversations extracted from raw movie scripts.
  • Cornell Movie-Quotes Corpus
    A collection of movie lines together with memorability annotations.
  • Cornell Natural Language Visual Reasoning (NLVR) Dataset
    Cornell Natural Language Visual Reasoning (NLVR) Dataset
    Cornell NLVR is a natural language grounding dataset with 92K examples. Each example shows an image and a sentence desciribing it, and is annotated with the truth-value of the sentence.
  • Cornell Newsroom
    Cornell Newsroom is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.
  • Diplomacy Betrayal Dataset
    A collection of interactions between allies in online Diplomacy games, with betrayal labels.
  • English Verb-object Co-occurrences
    Drawn from newswire text. Useful for work on distributional similarity.
  • Intelligence Squared Debate Dataset
    A collection of transcripts and metadata for debates from the series "Intelligence Squared Debates". For each debate, the transcript of each turn is given, along with information such as voting results pre- and post-debate, and audience reaction markers.
  • Movie-review Sentiment-analysis Data
    Sometimes referred to as the "Cornell movie-review corpus". Positive/negative- and “number-of-stars”-labeled documents; positive/negative and subjective/objective-labeled sentences, etc. Useful for work on sentiment analysis.
  • NuPrl Verbalizations
    Multiple (multi-parallel) English versions of computer-generated proofs; induced paraphrase thesaurus, etc. Useful for work on data-driven generation and paraphrasing.
  • Politeness Web App, API and Data
    Check out how polite your requests are by using this web app. Includes link to code and data.
  • QUOTUS Data and Visualization
    A collection of Obama's quotes of by news outlets, their location within the source White House speech, and information about the article in which the quotes were cited. A neat visualization is included.
  • Supreme Court Dialogs Corpus
    A collection of conversations from the U.S. Supreme Court Oral Arguments with metadata (including votes, case outcome and gender).
  • Tennis Transcript and Commentary Dataset
    Transcripts for tennis singles post-match press conferences for major tournaments.
  • Wikipedia Talk Page Conversations Corpus
    A collection of conversations from Wikipedia editor's Talk Pages with metadata (including status and gender).