Cornell NLP

Looking for software or data? The different groups and labs maintain up-to-date lists:

Yoav Artzi maintains software and data repositories here and here.

Cristian Danescu-Niculescu-Mizil maintains a list of software and data here.

Lillian Lee provides software and code for specific publications here.

David Mimno maintains Mallet and other resources here.

The Computational Linguistics Lab maintains a list of software here.

Sasha Rush's group maintains a list of released software here.

Other data releases are:

Congressional Floor-debates Data: Positive/negative-labeled documents; agree/disagree classification output, etc. Useful for work on sentiment analysis.

Cornell Movie-Dialogs Corpus: A metadata-rich collection of fictional conversations extracted from raw movie scripts.

Cornell Movie-Quotes Corpus: A collection of movie lines together with memorability annotations.

Diplomacy Betrayal Dataset: A collection of interactions between allies in online Diplomacy games, with betrayal labels.

English Verb-object Co-occurrences: Drawn from newswire text. Useful for work on distributional similarity.

Intelligence Squared Debate Dataset: A collection of transcripts and metadata for debates from the series "Intelligence Squared Debates". For each debate, the transcript of each turn is given, along with information such as voting results pre- and post-debate, and audience reaction markers.

Movie-review Sentiment-analysis Data: Sometimes referred to as the "Cornell movie-review corpus". Positive/negative- and “number-of-stars”-labeled documents; positive/negative and subjective/objective-labeled sentences, etc. Useful for work on sentiment analysis.

NuPrl Verbalizations: Multiple (multi-parallel) English versions of computer-generated proofs; induced paraphrase thesaurus, etc. Useful for work on data-driven generation and paraphrasing.

Politeness Web App, API and Data: Check out how polite your requests are by using this web app. Includes link to code and data.

QUOTUS Data and Visualization: A collection of Obama's quotes of by news outlets, their location within the source White House speech, and information about the article in which the quotes were cited. A neat visualization is included.

Supreme Court Dialogs Corpus: A collection of conversations from the U.S. Supreme Court Oral Arguments with metadata (including votes, case outcome and gender).

Tennis Transcript and Commentary Dataset: Transcripts for tennis singles post-match press conferences for major tournaments.

Wikipedia Talk Page Conversations Corpus: A collection of conversations from Wikipedia editor's Talk Pages with metadata (including status and gender).