[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches

26 Jun 2023


      ...
...
For tweets, if you are interested in up to 10-grams, you could find the
11-grams, and throw away tweets that have an identical 11-gram?
I use 11-grams to eliminate duplicate texts in the 17.5+ billion word NOW
corpus https://www.english-corpora.org/now/ from English-Corpora.org,
which grows by about 6-8 million words (10,000+ texts) each day. This is
done in SQL Server, which is the backbone
https://www.english-corpora.org/help/architecture.pdf for the corpora
from English-Corpora.org http://english-corpora.org/. All of the
processing of the texts (including generating URLs, downloading texts,
deletion of duplicates via 11-grams, PoS tagging, insertion into existing
corpus, etc) is done automatically every night using a customized pipeline
that I've created.
Mark Davies
On Fri, Jun 23, 2023 at 8:55 AM Darren Cook via Corpora <
corpora@list.elra.info> wrote:
...
...
many repeated exact tweets, or very similar tweets, leading to long
super strings of often 9 or 10 or more words together.
One approach that came to mind was https://arxiv.org/abs/2112.11446
where they remove duplicate documents if the 13-gram jaccard similarity
is over 0.8. (13-grams exclude spaces and punc.)
For tweets, if you are interested in up to 10-grams, you could find the
11-grams, and throw away tweets that have an identical 11-gram?
If data set size is the problem for discovering and removing duplicate
tweets, look into bloom filters.
For a ready-made package, https://docs.dedupe.io/en/latest/ was the one
that came up a lot in my search just now. (I don't know how it scales,
though.)
HTH,
Darren
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info
-- 
============================================
Mark Davies
english-corpora.org
mark-davies.org
============================================

2025

2024

2023

2022

[Corpora-List] Re: Counting multiple long (9+) n-grams in corpora: request for approaches