Counting multiple long (9+) n-grams in corpora: request for approaches

List overview All Threads
Download

newer

older

CFP: New Directions in Analyzing...

Second Call for Participation:...

beauchampd＠uni.coventry.ac.uk

23 Jun 2023 23 Jun '23

4:39 p.m.

Hi all, I'm doing analysis on a corpus on tweets from institutions. Regarding analysis of n-grams, it is quite unusual in that there are many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together. Naturally this makes accurate counting and classifying difficult due to the overlapping substrings. Does anyone know of any approaches or software which can count and classify n-grams in such circumstances? I am aware of approaches outlined by Buerki (2017) and O'Donnell (2011), but these do not seem practical due to the excessive length of the n-grams in the corpus. Does anyone know of any accessible methods or packages?

Any input much appreciated.

Show replies by date

Darren Cook

23 Jun 23 Jun

4:55 p.m.

...

many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together.

One approach that came to mind was https://arxiv.org/abs/2112.11446 where they remove duplicate documents if the 13-gram jaccard similarity is over 0.8. (13-grams exclude spaces and punc.)

For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?

If data set size is the problem for discovering and removing duplicate tweets, look into bloom filters.

For a ready-made package, https://docs.dedupe.io/en/latest/ was the one that came up a lot in my search just now. (I don't know how it scales, though.)

HTH, Darren

Mark Davies

26 Jun 26 Jun

3:28 a.m.

...

...
For tweets, if you are interested in up to 10-grams, you could find the

11-grams, and throw away tweets that have an identical 11-gram?

I use 11-grams to eliminate duplicate texts in the 17.5+ billion word NOW corpus https://www.english-corpora.org/now/ from English-Corpora.org, which grows by about 6-8 million words (10,000+ texts) each day. This is done in SQL Server, which is the backbone https://www.english-corpora.org/help/architecture.pdf for the corpora from English-Corpora.org http://english-corpora.org/. All of the processing of the texts (including generating URLs, downloading texts, deletion of duplicates via 11-grams, PoS tagging, insertion into existing corpus, etc) is done automatically every night using a customized pipeline that I've created.

Mark Davies

On Fri, Jun 23, 2023 at 8:55 AM Darren Cook via Corpora < corpora@list.elra.info> wrote:

...

...
many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together.

One approach that came to mind was https://arxiv.org/abs/2112.11446 where they remove duplicate documents if the 13-gram jaccard similarity is over 0.8. (13-grams exclude spaces and punc.)

For tweets, if you are interested in up to 10-grams, you could find the 11-grams, and throw away tweets that have an identical 11-gram?

If data set size is the problem for discovering and removing duplicate tweets, look into bloom filters.

For a ready-made package, https://docs.dedupe.io/en/latest/ was the one that came up a lot in my search just now. (I don't know how it scales, though.)

HTH, Darren _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

-- ============================================ Mark Davies english-corpora.org mark-davies.org ============================================

beauchampd＠uni.coventry.ac.uk

1:44 p.m.

Thanks for the input, will look into it.

Mousumi Akter

23 Jun 23 Jun

4:55 p.m.

Hi David,

You can look into the ROUGE package which does n-gram based matches. https://huggingface.co/spaces/evaluate-metric/rouge

You will find actual n-gram computation code here https://github.com/google-research/google-research/blob/master/rouge/rouge_s...

Also, Fuzzy string matching is another efficient approach to do substring matching https://github.com/seatgeek/thefuzz

Thanks, Mousumi

On Fri, Jun 23, 2023 at 9:39 AM David Beauchamp via Corpora < corpora@list.elra.info> wrote:

...

Hi all, I'm doing analysis on a corpus on tweets from institutions. Regarding analysis of n-grams, it is quite unusual in that there are many repeated exact tweets, or very similar tweets, leading to long super strings of often 9 or 10 or more words together. Naturally this makes accurate counting and classifying difficult due to the overlapping substrings. Does anyone know of any approaches or software which can count and classify n-grams in such circumstances? I am aware of approaches outlined by Buerki (2017) and O'Donnell (2011), but these do not seem practical due to the excessive length of the n-grams in the corpus. Does anyone know of any accessible methods or packages?

Any input much appreciated. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

beauchampd＠uni.coventry.ac.uk

26 Jun 26 Jun

1:45 p.m.

Thanks for the suggestion, will look in to it.

Ada Wan

27 Jun 27 Jun

1:59 p.m.

Hi David

What is the reason/purpose for your n-gram analysis or classification task? What software have you been using? Why not just exact string matching by characters? (Take log if length issues go out of hand?)

Best Ada

On Mon, Jun 26, 2023 at 1:45 PM David Beauchamp via Corpora < corpora@list.elra.info> wrote:

...

Thanks for the suggestion, will look in to it. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

christian.wartena＠hs-hannover.de

7:20 p.m.

Hello, we have used the Apriori-Algorithm to detect long identical text passages (https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That works quite well. I am not sure whether Frieda Jsi published the code, but it is quite easy to implement or I can send you the code.

Best Christian

Miloš Jakubíček

21 Jul 21 Jul

9:44 p.m.

Dear David,

you might have solved it meanwhile, but if not:

- if the task is to deduplicate, have a look at onion https://corpus.tools/wiki/Onion - if you need to count only, you can make a corpus in Sketch Engine to calculate, we use a suffix array to calculate ngrams up to the length of 20 by default, following:

Yamamoto and Church: Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus Computational Linguistics, Volume 27 Issue 1, March 2001, pp 1-30 http://www.aclweb.org/anthology/J01-1001

The interface displays n-grams up to the length 6 (though computed is 20), let me know if you need to display longer ones too.

Best regards, Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu

On Tue, 27 Jun 2023 at 19:20, Christian Wartena via Corpora < corpora@list.elra.info> wrote:

...

Hello, we have used the Apriori-Algorithm to detect long identical text passages ( https://link.springer.com/chapter/10.1007/978-3-030-86159-9_34). That works quite well. I am not sure whether Frieda Jsi published the code, but it is quite easy to implement or I can send you the code.

Best Christian _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

736

Age (days ago)

764

Last active (days ago)

corpora@list.elra.info

8 comments

7 participants

tags (0)

participants (7)

Ada Wan
beauchampd＠uni.coventry.ac.uk
christian.wartena＠hs-hannover.de
Darren Cook
Mark Davies
Miloš Jakubíček
Mousumi Akter