Dear David,
you might have solved it meanwhile, but if not:
- if the task is to deduplicate, have a look at
onion- if you need to count only, you can make a corpus in Sketch Engine to calculate, we use a suffix array to calculate ngrams up to the length of 20 by default, following:
Yamamoto and Church: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Computational Linguistics, Volume 27 Issue 1, March 2001, pp 1-30
http://www.aclweb.org/anthology/J01-1001
The interface displays n-grams up to the length 6 (though computed is 20), let me know if you need to display longer ones too.
Best regards,
Milos Jakubicek
CEO, Lexical Computing
Brno, CZ | Brighton, UK