Neural Chat Datasets


This folder contains the datasets we use in the Neural Chat project.

Pre-processed Datasets

These pre-processed datasets reddit_casual_preprocessed (~10.3GB) (.zip, .tar.gz) and cornell_preprocessed (~3.6GB) (.zip, tar.gz) contain train, validation, and test sets of each dialog corpora. In addition, these files include:

Reddit Casual Conversations Dialogue Dataset

The reddit_casual.zip (~24MB) contains a json file contains a list of 108,933 conversations where the key 'lines' contains a list of line objects where each line includes the text ('text') and the author ('character'). Lines are parsed by sentence and authors are anonymized with the name 0 or 1.