From 724fb51e6171e6ab810552e0dd0157a617db7b09 Mon Sep 17 00:00:00 2001 From: David Bremner Date: Mon, 19 Nov 2012 13:28:45 -0400 Subject: [PATCH] add a page describing the email corpus --- corpus.mdwn | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 corpus.mdwn diff --git a/corpus.mdwn b/corpus.mdwn new file mode 100644 index 0000000..5d262bc --- /dev/null +++ b/corpus.mdwn @@ -0,0 +1,30 @@ +## Notmuch Email Corpus + +A corpus of about 108k messages is available for performance testing of +notmuch (or other uses). + +The contents are as follows + +Mail/notmuch-archive + +archive of the notmuch mailing list +- last updated 2012-11-17 +- converted from mbox with mb2md 3.20. + +Mail/enron + +selected data from the EDRM v2 enron data set +- CC Attribution: "ZL Technologies, Inc. (http://www.zlti.com)" +- Downloaded via bittorrent + http://www.searchdaimon.com/community/dataset/ +- massaged with scripts/unpack-enron.sh + +Because of the size of the archive, it is not currently available from +http://notmuchmail.org, but can be downloaded from: + +- http://tesseract.cs.unb.ca/notmuch/notmuch-email-corpus-0.1.tar.gz + +A signature from key "815B 6398 2A79 F8E7 C727 86C4 762B 57BB 7842 06AD" +can be found in + +- http://tesseract.cs.unb.ca/notmuch/notmuch-email-corpus-0.1.tar.gz.asc -- 2.43.0