Neural Chat Datasets
This folder contains the datasets we use in the Neural Chat project.
These pre-processed datasets reddit_casual_preprocessed (~10.3GB) (.zip, .tar.gz) and cornell_preprocessed (~3.6GB) (.zip, tar.gz) contain train, validation, and test sets of each dialog corpora. In addition, these files include:
- The PCA weights for downsizing Infersent embedding (v1_PCA_model_0.95.pkl)
- The reduced version of Infersent embedding (sentence_embeddings_1_PCA_0.95.pkl)
- Emoji embeddings (sentence_emojis.pkl)
Reddit Casual Conversations Dialogue Dataset
The reddit_casual.zip (~24MB) contains a json file contains a list of 108,933 conversations where the key 'lines' contains a list of line objects where each line includes the text ('text') and the author ('character'). Lines are parsed by sentence and authors are anonymized with the name 0 or 1.