>>
For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?
I use 11-grams to eliminate duplicate texts in the 17.5+ billion word
NOW corpus from
English-Corpora.org, which grows by about 6-8 million words (10,000+ texts) each day. This is done in SQL Server, which is the
backbone for the corpora from
English-Corpora.org. All of the processing of the texts (including generating URLs, downloading texts, deletion of duplicates via 11-grams, PoS tagging, insertion into existing corpus, etc) is done automatically every night using a customized pipeline that I've created.