many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together.
One approach that came to mind was https://arxiv.org/abs/2112.11446 where they remove duplicate documents if the 13-gram jaccard similarity is over 0.8. (13-grams exclude spaces and punc.)
For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?
If data set size is the problem for discovering and removing duplicate tweets, look into bloom filters.
For a ready-made package, https://docs.dedupe.io/en/latest/ was the one that came up a lot in my search just now. (I don't know how it scales, though.)
HTH, Darren