Hi all, I'm doing analysis on a corpus on tweets from institutions. Regarding analysis of n-grams, it is quite unusual in that there are many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together. Naturally this makes accurate counting and classifying difficult due to the overlapping substrings. Does anyone know of any approaches or software which can count and classify n-grams in such circumstances? I am aware of approaches outlined by Buerki (2017) and O'Donnell (2011), but these do not seem practical due to the excessive length of the n-grams in the corpus. Does anyone know of any accessible methods or packages?
Any input much appreciated.
many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together.
One approach that came to mind was https://arxiv.org/abs/2112.11446 where they remove duplicate documents if the 13-gram jaccard similarity is over 0.8. (13-grams exclude spaces and punc.)
For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?
If data set size is the problem for discovering and removing duplicate tweets, look into bloom filters.
For a ready-made package, https://docs.dedupe.io/en/latest/ was the one that came up a lot in my search just now. (I don't know how it scales, though.)
HTH, Darren
For tweets, if you are interested in up to 10-grams, you could find the
11-grams, and throw away tweets that have an identical 11-gram?
I use 11-grams to eliminate duplicate texts in the 17.5+ billion word NOW corpus https://www.english-corpora.org/now/ from English-Corpora.org, which grows by about 6-8 million words (10,000+ texts) each day. This is done in SQL Server, which is the backbone https://www.english-corpora.org/help/architecture.pdf for the corpora from English-Corpora.org http://english-corpora.org/. All of the processing of the texts (including generating URLs, downloading texts, deletion of duplicates via 11-grams, PoS tagging, insertion into existing corpus, etc) is done automatically every night using a customized pipeline that I've created.
Mark Davies
On Fri, Jun 23, 2023 at 8:55 AM Darren Cook via Corpora < corpora@list.elra.info> wrote:
many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together.
One approach that came to mind was https://arxiv.org/abs/2112.11446 where they remove duplicate documents if the 13-gram jaccard similarity is over 0.8. (13-grams exclude spaces and punc.)
For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?
If data set size is the problem for discovering and removing duplicate tweets, look into bloom filters.
For a ready-made package, https://docs.dedupe.io/en/latest/ was the one that came up a lot in my search just now. (I don't know how it scales, though.)
HTH, Darren _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi David,
You can look into the ROUGE package which does n-gram based matches. https://huggingface.co/spaces/evaluate-metric/rouge
You will find actual n-gram computation code here https://github.com/google-research/google-research/blob/master/rouge/rouge_s...
Also, Fuzzy string matching is another efficient approach to do substring matching https://github.com/seatgeek/thefuzz
Thanks, Mousumi
On Fri, Jun 23, 2023 at 9:39 AM David Beauchamp via Corpora < corpora@list.elra.info> wrote:
Hi all, I'm doing analysis on a corpus on tweets from institutions. Regarding analysis of n-grams, it is quite unusual in that there are many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together. Naturally this makes accurate counting and classifying difficult due to the overlapping substrings. Does anyone know of any approaches or software which can count and classify n-grams in such circumstances? I am aware of approaches outlined by Buerki (2017) and O'Donnell (2011), but these do not seem practical due to the excessive length of the n-grams in the corpus. Does anyone know of any accessible methods or packages?
Any input much appreciated. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi David
What is the reason/purpose for your n-gram analysis or classification task? What software have you been using? Why not just exact string matching by characters? (Take log if length issues go out of hand?)
Best Ada
On Mon, Jun 26, 2023 at 1:45 PM David Beauchamp via Corpora < corpora@list.elra.info> wrote:
Thanks for the suggestion, will look in to it. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hello, we have used the Apriori-Algorithm to detect long identical text passages (https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That works quite well. I am not sure whether Frieda Jsi published the code, but it is quite easy to implement or I can send you the code.
Best Christian
Dear David,
you might have solved it meanwhile, but if not:
- if the task is to deduplicate, have a look at onion https://corpus.tools/wiki/Onion - if you need to count only, you can make a corpus in Sketch Engine to calculate, we use a suffix array to calculate ngrams up to the length of 20 by default, following:
Yamamoto and Church: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus Computational Linguistics, Volume 27 Issue 1, March 2001, pp 1-30 http://www.aclweb.org/anthology/J01-1001
The interface displays n-grams up to the length 6 (though computed is 20), let me know if you need to display longer ones too.
Best regards, Milos Jakubicek
CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu
On Tue, 27 Jun 2023 at 19:20, Christian Wartena via Corpora < corpora@list.elra.info> wrote:
Hello, we have used the Apriori-Algorithm to detect long identical text passages ( https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That works quite well. I am not sure whether Frieda Jsi published the code, but it is quite easy to implement or I can send you the code.
Best Christian _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info