Hi all, I'm doing analysis on a corpus on tweets from institutions. Regarding analysis of n-grams, it is quite unusual in that there are many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together. Naturally this makes accurate counting and classifying difficult due to the overlapping substrings. Does anyone know of any approaches or software which can count and classify n-grams in such circumstances? I am aware of approaches outlined by Buerki (2017) and O'Donnell (2011), but these do not seem practical due to the excessive length of the n-grams in the corpus. Does anyone know of any accessible methods or packages?
Any input much appreciated.