[Corpora-List]Complex Word Identification in French

List overview All Threads
Download

newer

older

[Corpora-List]Multiple NLP Postdoc...

[Corpora-List]Call for...

Christopher Collins

20 Jun 2022 20 Jun '22

4:53 p.m.

Hello,

I'm looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we're hoping someone has already made a tool available.

We found this but that's it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/himhttps://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.cahttp://vialab.ca/

Attachments:

attachment.html (text/html — 3.3 KB)

Show replies by date

Ada Wan

20 Jun 20 Jun

5:33 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:

...

Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

Sylvain Kahane

6:06 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

“We’re destroying words–scores of them, hundreds of them, every day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the great advantage is in the verbs and adjectives, but there are hundreds of nouns that can be got rid of as well. It isn’t only the synonyms; there are also the antonyms. After all, what justification is there for a word which is simply the opposite of some other words? A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better, because it’s an exact opposite, which the other is not. Or again, if you want a stronger version of ‘good,’ what sense is there in having a whole string of vague useless words like ‘excellent’ and ‘splendid’ and all the rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want something stronger still. Of course we use those forms already, but in the final version of Newspeak there’ll be nothing else. In the end the whole notion of goodness and badness will be covered by only six words–in reality, only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

...

Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :

Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins Christopher.Collins@ontariotechu.ca wrote: Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/him] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

Ada Wan

8:27 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:

...

“We’re destroying words–scores of them, hundreds of them, every day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the great advantage is in the verbs and adjectives, but there are hundreds of nouns that can be got rid of as well. It isn’t only the synonyms; there are also the antonyms. After all, what justification is there for a word which is simply the opposite of some other words? A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better, because it’s an exact opposite, which the other is not. Or again, if you want a stronger version of ‘good,’ what sense is there in having a whole string of vague useless words like ‘excellent’ and ‘splendid’ and all the rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want something stronger still. Of course we use those forms already, but in the final version of Newspeak there’ll be nothing else. In the end the whole notion of goodness and badness will be covered by only six words–in reality, only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

...
Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :

Hi Christopher,

It is of the best interest of the community to discontinue the usage of

"word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved.

...
Despite the term has been casually adopted by many in the past, like

many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned.

...
Last but not least, I noticed that you are located in Canada, in the

event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it).

...
Please also consult recent work in this area:

https://openreview.net/forum?id=-llS6TiOew.

...
Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <

Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex

word identification or word difficulty rating in French for a reading application.

...
As a backup plan we can use measures like corpus frequency, length,

number of senses, but we’re hoping someone has already made a tool available.

...
We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/him] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

Daniel HENKEL

9:47 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

/In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, .../

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

On 20/06/2022 20:27, Ada Wan wrote:

...

(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:

“We’re destroying words–scores of them, hundreds of them, every
day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the
great advantage is in the verbs and adjectives, but there are
hundreds of nouns that can be got rid of as well. It isn’t only
the synonyms; there are also the antonyms. After all, what
justification is there for a word which is simply the opposite of
some other words? A word contains its opposite in itself. Take
‘good,’ for instance. If you have a word like ‘good,’ what need is
there for a word like ‘bad’? ‘Ungood’ will do just as well–better,
because it’s an exact opposite, which the other is not. Or again,
if you want a stronger version of ‘good,’ what sense is there in
having a whole string of vague useless words like ‘excellent’ and
‘splendid’ and all the rest of them? ‘Plusgood’ covers the
meaning, or ‘doubleplusgood’ if you want something stronger still.
Of course we use those forms already, but in the final version of
Newspeak there’ll be nothing else. In the end the whole notion of
goodness and badness will be covered by only six words–in reality,
only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

> Le 20 juin 2022 à 17:33, Ada Wan <adawan919@gmail.com> a écrit :
>
> Hi Christopher,
>
> It is of the best interest of the community to discontinue the
usage of "word". The term is not only very shaky in its foundation
(if any), but it can also effect disparity in performance in
computational processing and robustness when human evaluation is
involved.
> Despite the term has been casually adopted by many in the past,
like many un-PC terms that may have an inappropriate undertone, it
needs to be discouraged and abandoned.
> Last but not least, I noticed that you are located in Canada, in
the event that you were to work with any indigenous communities,
one MUST be advised to be careful with the usage of such term ---
you could be imposing your own (EN- / FR- / dominant
language-centric) view onto another individual/community. There is
an element of cultural and linguistic hegemony with the usage of
such term (including and not limited to making applications with it).
> Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
>
> Feel free to get in touch if you should have any questions.
>
> Best,
> Ada
>
>
> On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins
<Christopher.Collins@ontariotechu.ca> wrote:
> Hello,
>
>
>
> I’m looking for any open source or cloud-hosted solution for
complex word identification or word difficulty rating in French
for a reading application.
>
>
>
> As a backup plan we can use measures like corpus frequency,
length, number of senses, but we’re hoping someone has already
made a tool available.
>
>
>
> We found this but that’s it: https://github.com/sheffieldnlp/cwi
>
>
>
> Would appreciate any tips!
>
>
>
> Thanks,
>
>
>
> Chris
>
>
>
> Christopher Collins [he/him]
> Associate Professor - Faculty of Science
> Canada Research Chair in Linguistic Information Visualization
> Ontario Tech University
> vialab.ca <http://vialab.ca>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL /Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit/ Université Paris 8 Vincennes-St-Denis /“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”/ U. Eco

Ada Wan

9:49 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

@Daniel: Yeah, I think our whole field could benefit from a curriculum update...

On Mon, Jun 20, 2022 at 9:47 PM Daniel HENKEL daniel.henkel@univ-paris8.fr wrote:

...

Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

*In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, ...*

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

DH

On 20/06/2022 20:27, Ada Wan wrote:

(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:

...
“We’re destroying words–scores of them, hundreds of them, every day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the great advantage is in the verbs and adjectives, but there are hundreds of nouns that can be got rid of as well. It isn’t only the synonyms; there are also the antonyms. After all, what justification is there for a word which is simply the opposite of some other words? A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better, because it’s an exact opposite, which the other is not. Or again, if you want a stronger version of ‘good,’ what sense is there in having a whole string of vague useless words like ‘excellent’ and ‘splendid’ and all the rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want something stronger still. Of course we use those forms already, but in the final version of Newspeak there’ll be nothing else. In the end the whole notion of goodness and badness will be covered by only six words–in reality, only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

...
Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :

Hi Christopher,

It is of the best interest of the community to discontinue the usage of

"word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved.

...
Despite the term has been casually adopted by many in the past, like

many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned.

...
Last but not least, I noticed that you are located in Canada, in the

event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it).

...
Please also consult recent work in this area:

https://openreview.net/forum?id=-llS6TiOew.

...
Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <

Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex

word identification or word difficulty rating in French for a reading application.

...
As a backup plan we can use measures like corpus frequency, length,

number of senses, but we’re hoping someone has already made a tool available.

...
We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/him] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

Daniel HENKEL

10:13 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Not to mention all these shamefully unscientific posts on Corporalist:

/12th International Global W*rdnet Conference Donostia / San Sebastian, Basque Country 23-27, 2023 Global W*rdnet Association: www.globalw*rdnet.org// //Conference website: https://hitz.eus/gwc2023/

/18th Workshop on Multiw*rd Expressions (MWE 2022) Organized and sponsored by SIGLEX, the Special Interest Group on the Lexicon of the ACL/

/The 5th Workshop on Multi-w*rd Units in Machine Translation and Translation Technology (MUMTTT 2022) Malaga, 30th September 2022/

...

Definitely time for some lexical/terminological restrictions/updates, for the sake of goodthink/processing, and science!

(actually "science" is heretical/redundant, "goodthink/processing" will do the job:

/"As we have already seen in the case of the word FREE, w*rds which had once borne a heretical meaning were sometimes retained for the sake of convenience, but only with the undesirable meanings purged out of them. Countless other w*rds such as HONOUR, JUSTICE, MORALITY, INTERNATIONALISM, DEMOCRACY, SCIENCE, and RELIGION had simply ceased to exist."/)

On 20/06/2022 21:47, Daniel HENKEL wrote:

...

Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

/In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, .../

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

DH

On 20/06/2022 20:27, Ada Wan wrote:

...
(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:
“We’re destroying words–scores of them, hundreds of them, every
day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the
great advantage is in the verbs and adjectives, but there are
hundreds of nouns that can be got rid of as well. It isn’t only
the synonyms; there are also the antonyms. After all, what
justification is there for a word which is simply the opposite of
some other words? A word contains its opposite in itself. Take
‘good,’ for instance. If you have a word like ‘good,’ what need
is there for a word like ‘bad’? ‘Ungood’ will do just as
well–better, because it’s an exact opposite, which the other is
not. Or again, if you want a stronger version of ‘good,’ what
sense is there in having a whole string of vague useless words
like ‘excellent’ and ‘splendid’ and all the rest of them?
‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want
something stronger still. Of course we use those forms already,
but in the final version of Newspeak there’ll be nothing else. In
the end the whole notion of goodness and badness will be covered
by only six words–in reality, only one word. Don’t you see the
beauty of that, Ada?…”

George Orwell, 1984


> Le 20 juin 2022 à 17:33, Ada Wan <adawan919@gmail.com> a écrit :
>
> Hi Christopher,
>
> It is of the best interest of the community to discontinue the
usage of "word". The term is not only very shaky in its
foundation (if any), but it can also effect disparity in
performance in computational processing and robustness when human
evaluation is involved.
> Despite the term has been casually adopted by many in the past,
like many un-PC terms that may have an inappropriate undertone,
it needs to be discouraged and abandoned.
> Last but not least, I noticed that you are located in Canada,
in the event that you were to work with any indigenous
communities, one MUST be advised to be careful with the usage of
such term --- you could be imposing your own (EN- / FR- /
dominant language-centric) view onto another
individual/community. There is an element of cultural and
linguistic hegemony with the usage of such term (including and
not limited to making applications with it).
> Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
>
> Feel free to get in touch if you should have any questions.
>
> Best,
> Ada
>
>
> On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins
<Christopher.Collins@ontariotechu.ca> wrote:
> Hello,
>
>
>
> I’m looking for any open source or cloud-hosted solution for
complex word identification or word difficulty rating in French
for a reading application.
>
>
>
> As a backup plan we can use measures like corpus frequency,
length, number of senses, but we’re hoping someone has already
made a tool available.
>
>
>
> We found this but that’s it: https://github.com/sheffieldnlp/cwi
>
>
>
> Would appreciate any tips!
>
>
>
> Thanks,
>
>
>
> Chris
>
>
>
> Christopher Collins [he/him]
> Associate Professor - Faculty of Science
> Canada Research Chair in Linguistic Information Visualization
> Ontario Tech University
> vialab.ca <http://vialab.ca>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info
UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info
-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL /Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit/ Université Paris 8 Vincennes-St-Denis

/“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”/ U. Eco

UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info

Daniel HENKEL

11:18 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Just to clarify my position, I don't actually think that the En. lexeme “w*rd” is easy to define, precise or theoretically well-founded (I prefer “lexeme” here, as Ada's previous use of “term” is improper from a wusterian point of view, given that “w*rd” lacks distinctive traits due to its notorious ambiguity).

The situation is similar in mathematics where “number” is used to denote a variety of concepts such as natural numbers, integers, fractions, real numbers, irrational numbers, imaginary numbers … which may be inclusive or exclusive of each other. There are thus numerous contexts in which colloquial use of the w*rd “number” would be imprecise, inappropriate and might even lead to confusion. Nonetheless, I'm not aware of any mathematicians who advocate censorship of the w*rd “number”.

If “w*rd” lacks a clear definition and a clear theoretical foundation (which I actually agree with), then it can't really be used as a “term” until the concept has been given an adequate definition in relation to other terms within the relevant domain or theoretical framework.

On the other hand, though precise terminology is always preferable whenever and wherever precision is necessary, there's nothing ever to be gained scientifically through censorship (sorry to use an ungood w*rd, but, in all earnestness, when I see a spade I call it a “spade”).

On 20/06/2022 22:13, Daniel HENKEL wrote:

...

Not to mention all these shamefully unscientific posts on Corporalist:

/12th International Global W*rdnet Conference Donostia / San Sebastian, Basque Country 23-27, 2023 Global W*rdnet Association: www.globalw*rdnet.org// //Conference website: https://hitz.eus/gwc2023/

/18th Workshop on Multiw*rd Expressions (MWE 2022) Organized and sponsored by SIGLEX, the Special Interest Group on the Lexicon of the ACL/

/The 5th Workshop on Multi-w*rd Units in Machine Translation and Translation Technology (MUMTTT 2022) Malaga, 30th September 2022/

...

Definitely time for some lexical/terminological restrictions/updates, for the sake of goodthink/processing, and science!

(actually "science" is heretical/redundant, "goodthink/processing" will do the job:

/"As we have already seen in the case of the word FREE, w*rds which had once borne a heretical meaning were sometimes retained for the sake of convenience, but only with the undesirable meanings purged out of them. Countless other w*rds such as HONOUR, JUSTICE, MORALITY, INTERNATIONALISM, DEMOCRACY, SCIENCE, and RELIGION had simply ceased to exist."/)

DH

On 20/06/2022 21:47, Daniel HENKEL wrote:

...
Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

/In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, .../

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

DH

On 20/06/2022 20:27, Ada Wan wrote:

...
(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:
“We’re destroying words–scores of them, hundreds of them, every
day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the
great advantage is in the verbs and adjectives, but there are
hundreds of nouns that can be got rid of as well. It isn’t only
the synonyms; there are also the antonyms. After all, what
justification is there for a word which is simply the opposite
of some other words? A word contains its opposite in itself.
Take ‘good,’ for instance. If you have a word like ‘good,’ what
need is there for a word like ‘bad’? ‘Ungood’ will do just as
well–better, because it’s an exact opposite, which the other is
not. Or again, if you want a stronger version of ‘good,’ what
sense is there in having a whole string of vague useless words
like ‘excellent’ and ‘splendid’ and all the rest of them?
‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want
something stronger still. Of course we use those forms already,
but in the final version of Newspeak there’ll be nothing else.
In the end the whole notion of goodness and badness will be
covered by only six words–in reality, only one word. Don’t you
see the beauty of that, Ada?…”

George Orwell, 1984


> Le 20 juin 2022 à 17:33, Ada Wan <adawan919@gmail.com> a écrit :
>
> Hi Christopher,
>
> It is of the best interest of the community to discontinue the
usage of "word". The term is not only very shaky in its
foundation (if any), but it can also effect disparity in
performance in computational processing and robustness when
human evaluation is involved.
> Despite the term has been casually adopted by many in the
past, like many un-PC terms that may have an inappropriate
undertone, it needs to be discouraged and abandoned.
> Last but not least, I noticed that you are located in Canada,
in the event that you were to work with any indigenous
communities, one MUST be advised to be careful with the usage of
such term --- you could be imposing your own (EN- / FR- /
dominant language-centric) view onto another
individual/community. There is an element of cultural and
linguistic hegemony with the usage of such term (including and
not limited to making applications with it).
> Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
>
> Feel free to get in touch if you should have any questions.
>
> Best,
> Ada
>
>
> On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins
<Christopher.Collins@ontariotechu.ca> wrote:
> Hello,
>
>
>
> I’m looking for any open source or cloud-hosted solution for
complex word identification or word difficulty rating in French
for a reading application.
>
>
>
> As a backup plan we can use measures like corpus frequency,
length, number of senses, but we’re hoping someone has already
made a tool available.
>
>
>
> We found this but that’s it: https://github.com/sheffieldnlp/cwi
>
>
>
> Would appreciate any tips!
>
>
>
> Thanks,
>
>
>
> Chris
>
>
>
> Christopher Collins [he/him]
> Associate Professor - Faculty of Science
> Canada Research Chair in Linguistic Information Visualization
> Ontario Tech University
> vialab.ca <http://vialab.ca>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list -- corpora@list.elra.info
> To unsubscribe send an email to corpora-leave@list.elra.info
UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info
-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL /Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit/ Université Paris 8 Vincennes-St-Denis

/“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”/ U. Eco

UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info
-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL /Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit/ Université Paris 8 Vincennes-St-Denis

/“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”/ U. Eco

UNSUBSCRIBE from this page:http://mailman.uib.no/options/corpora Corpora mailing list --corpora@list.elra.info To unsubscribe send an email tocorpora-leave@list.elra.info

Ada Wan

11:46 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

I used "term" bc it makes room for a little bit of (mental) shifting for some ppl... Everyone (non-specialists included) uses "w*rd". Nothing is 100% --- when it comes to "language" or abstract concepts (or everything in the empirical world?), but 99% is better than 98 or 60%. (E.g. we may have 99% of known lgs in character encoding down vs. a very shaky never-ending-story with w-segmentation, not even for one language.)

On Mon, Jun 20, 2022 at 11:18 PM Daniel HENKEL daniel.henkel@univ-paris8.fr wrote:

...

Just to clarify my position, I don't actually think that the En. lexeme “w*rd” is easy to define, precise or theoretically well-founded (I prefer “lexeme” here, as Ada's previous use of “term” is improper from a wusterian point of view, given that “w*rd” lacks distinctive traits due to its notorious ambiguity).

The situation is similar in mathematics where “number” is used to denote a variety of concepts such as natural numbers, integers, fractions, real numbers, irrational numbers, imaginary numbers … which may be inclusive or exclusive of each other. There are thus numerous contexts in which colloquial use of the w*rd “number” would be imprecise, inappropriate and might even lead to confusion. Nonetheless, I'm not aware of any mathematicians who advocate censorship of the w*rd “number”.

If “w*rd” lacks a clear definition and a clear theoretical foundation (which I actually agree with), then it can't really be used as a “term” until the concept has been given an adequate definition in relation to other terms within the relevant domain or theoretical framework.

On the other hand, though precise terminology is always preferable whenever and wherever precision is necessary, there's nothing ever to be gained scientifically through censorship (sorry to use an ungood w*rd, but, in all earnestness, when I see a spade I call it a “spade”).

DH

On 20/06/2022 22:13, Daniel HENKEL wrote:

Not to mention all these shamefully unscientific posts on Corporalist:

*12th International Global W*rdnet Conference Donostia / San Sebastian, Basque Country 23-27, 2023 Global W*rdnet Association: www.globalw*rdnet.org http://rdnet.org* *Conference website: https://hitz.eus/gwc2023 https://hitz.eus/gwc2023*

*18th Workshop on Multiw*rd Expressions (MWE 2022) Organized and sponsored by SIGLEX, the Special Interest Group on the Lexicon of the ACL*

*The 5th Workshop on Multi-w*rd Units in Machine Translation and Translation Technology (MUMTTT 2022) Malaga, 30th September 2022*

...

Definitely time for some lexical/terminological restrictions/updates, for the sake of goodthink/processing, and science!

(actually "science" is heretical/redundant, "goodthink/processing" will do the job:

*"As we have already seen in the case of the word FREE, w*rds which had once borne a heretical meaning were sometimes retained for the sake of convenience, but only with the undesirable meanings purged out of them. Countless other w*rds such as HONOUR, JUSTICE, MORALITY, INTERNATIONALISM, DEMOCRACY, SCIENCE, and RELIGION had simply ceased to exist."*)

DH

On 20/06/2022 21:47, Daniel HENKEL wrote:

Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

*In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, ...*

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

DH

On 20/06/2022 20:27, Ada Wan wrote:

(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:

...
“We’re destroying words–scores of them, hundreds of them, every day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the great advantage is in the verbs and adjectives, but there are hundreds of nouns that can be got rid of as well. It isn’t only the synonyms; there are also the antonyms. After all, what justification is there for a word which is simply the opposite of some other words? A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better, because it’s an exact opposite, which the other is not. Or again, if you want a stronger version of ‘good,’ what sense is there in having a whole string of vague useless words like ‘excellent’ and ‘splendid’ and all the rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want something stronger still. Of course we use those forms already, but in the final version of Newspeak there’ll be nothing else. In the end the whole notion of goodness and badness will be covered by only six words–in reality, only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

...
Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :

Hi Christopher,

It is of the best interest of the community to discontinue the usage of

"word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved.

...
Despite the term has been casually adopted by many in the past, like

many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned.

...
Last but not least, I noticed that you are located in Canada, in the

event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it).

...
Please also consult recent work in this area:

https://openreview.net/forum?id=-llS6TiOew.

...
Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <

Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex

word identification or word difficulty rating in French for a reading application.

...
As a backup plan we can use measures like corpus frequency, length,

number of senses, but we’re hoping someone has already made a tool available.

...
We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/him] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

Ada Wan

11:27 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Yeah... I really don't know what to do with "the resistance", "the ignorance" (as in, both the practice of intentionally ignoring my results, and otherwise)... etc.. Many of us are so used to both naming and processing at such granularity... it'd take the whole world for change to happen and I hope we could work together instead of against each other.

If we sustain the w hackery, we are hurting ourselves / our students in the end. From the tech point of view, things are getting ever finer and our hardware can take on more, I don't know what ppl have in mind if/when they want our next gen to be quibbling about things like "is ... a w*rd?" or whether the w boundary should be placed exactly here or there....

I suggested in "Fairness in Representation https://openreview.net/pdf?id=-llS6TiOew" that we should use clearer nomenclature: "one can simply describe languages and their statistical profiles with respect to their representational granularity in characters or bytes (which are and/or can be exhaustively standardized in computing), or refer to sequences as longer/shorter or having a higher/lower vocabulary size when comparing them with each other, rather than “richer”/“poorer” based on concepts (e.g. “words”, “sentences”) that can be ambiguous, contested, and inaccessible to many."

And in the event that our downstream task results are so good already, there will always be room for content creation. That's also what I tried to advocate in the paper: quality multiway datasets for science, eval, and documentation --- I wished the world would give more worth to data. There is a lot of room to translate our engineering back to science, to re-educate students in the lang sci/tech to become more stats savvy, and to re-educate the world about fairness in multilinguality...

On Mon, Jun 20, 2022 at 10:13 PM Daniel HENKEL daniel.henkel@univ-paris8.fr wrote:

...

Not to mention all these shamefully unscientific posts on Corporalist:

*12th International Global W*rdnet Conference Donostia / San Sebastian, Basque Country 23-27, 2023 Global W*rdnet Association: www.globalw*rdnet.org http://rdnet.org* *Conference website: https://hitz.eus/gwc2023 https://hitz.eus/gwc2023*

*18th Workshop on Multiw*rd Expressions (MWE 2022) Organized and sponsored by SIGLEX, the Special Interest Group on the Lexicon of the ACL*

*The 5th Workshop on Multi-w*rd Units in Machine Translation and Translation Technology (MUMTTT 2022) Malaga, 30th September 2022*

...

Definitely time for some lexical/terminological restrictions/updates, for the sake of goodthink/processing, and science!

(actually "science" is heretical/redundant, "goodthink/processing" will do the job:

*"As we have already seen in the case of the word FREE, w*rds which had once borne a heretical meaning were sometimes retained for the sake of convenience, but only with the undesirable meanings purged out of them. Countless other w*rds such as HONOUR, JUSTICE, MORALITY, INTERNATIONALISM, DEMOCRACY, SCIENCE, and RELIGION had simply ceased to exist."*)

DH

On 20/06/2022 21:47, Daniel HENKEL wrote:

Looks as if Linguistlist is in need of some scientific enlightenment as well :

http://linguistlist.org/issues/33/33-2063.html

*In the new, thoroughly revised second edition of W*rds of Wonder: Endangered Languages and What They Tell Us, Second Edition (formerly called Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned scholar Nicholas Evans delivers an accessible and incisive text covering the impact of mass language endangerment. The distinguished author explores issues surrounding the preservation of indigenous languages, ...*

(ungood w*rds unw*rded to protect the faint of mind against ungood thinking/processing).

Best,

DH

On 20/06/2022 20:27, Ada Wan wrote:

(I just expounded on a point as a twitter reply today re the granularity of one's thinking/processing. Pls feel free to read that also.)

One can think of it in a less binary manner --- not "good" vs "bad", not "words" then "sentences", but to think of an utterance/sequence with all the finer connections in between... That is the beauty of language --- from a "philological" point of view.

I am not sure, though, if you were speaking from a scientific perspective, because I have a paper to back my argument in that regard.

On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:

...
“We’re destroying words–scores of them, hundreds of them, every day. We’re cutting the language down to the bone.” […]

“It’s a beautiful thing, the destruction of words. Of course the great advantage is in the verbs and adjectives, but there are hundreds of nouns that can be got rid of as well. It isn’t only the synonyms; there are also the antonyms. After all, what justification is there for a word which is simply the opposite of some other words? A word contains its opposite in itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better, because it’s an exact opposite, which the other is not. Or again, if you want a stronger version of ‘good,’ what sense is there in having a whole string of vague useless words like ‘excellent’ and ‘splendid’ and all the rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you want something stronger still. Of course we use those forms already, but in the final version of Newspeak there’ll be nothing else. In the end the whole notion of goodness and badness will be covered by only six words–in reality, only one word. Don’t you see the beauty of that, Ada?…”

George Orwell, 1984

...
Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :

Hi Christopher,

It is of the best interest of the community to discontinue the usage of

"word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved.

...
Despite the term has been casually adopted by many in the past, like

many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned.

...
Last but not least, I noticed that you are located in Canada, in the

event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it).

...
Please also consult recent work in this area:

https://openreview.net/forum?id=-llS6TiOew.

...
Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <

Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex

word identification or word difficulty rating in French for a reading application.

...
As a backup plan we can use measures like corpus frequency, length,

number of senses, but we’re hoping someone has already made a tool available.

...
We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

Christopher Collins [he/him] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL

*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit* Université Paris 8 Vincennes-St-Denis

*“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.”* U. Eco

Daniel HENKEL

11:36 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

WAR IS PEACE FREEDOM IS SLAVERY IGNORANCE IS STRENGTH

You are a flaw in the pattern, Winston. You are a stain that must be wiped out. Did I not tell you just now that we are different from the persecutors of the past? We are not content with negative obedience, nor even with the most abject submission. When finally you surrender to us, it must be of your own free will. We do not destroy the heretic because he resists us: so long as he resists us we never destroy him. We convert him, we capture his inner mind, we reshape him. We burn all evil and all illusion out of him; we bring him over to our side, not in appearance, but genuinely, heart and soul. We make him one of ourselves before we kill him. It is intolerable to us that an erroneous thought should exist anywhere in the world, however secret and powerless it may be.

(G. Orwell, 1984)

On 20/06/2022 23:27, Ada Wan wrote:

...

Yeah... I really don't know what to do with "the resistance", "the ignorance"

Flor, Michael

21 Jun 21 Jun

12:14 a.m.

New subject: [Corpora-List]Re: [EXTERNAL] Re: Complex Word Identification in French

The notion of 'word' has difficulties in linguistics. But not enough for abandoning it.

The argument from the paper "Fairness in Representation for Multilingual NLP" is not convincing at all. Even if the early findings are correct for transformers , applicability to human language faculty is not yet supported.

On the other hand, it is not even needed. Developmental linguists have noted long ago that babies acquire all natural languages at approximately the same rate (under some 'standard conditions'), despite vast morphological and other differences between languages. Thus, in some sense, all natural human languages are already deemed 'equal' vis-a-vis acquisition complexity.

For language learning later in life, if one's native language is morphologically rich, learning (some types of) morphologically rich languages (as an adult) is a bit easier than learning a language that is very different, etc.

Complexity of words in a language for non-native speakers/learners is actually a big issue and a field of research in EFL (and now in NLP as well).

Finally, word complexity is often defined within the same language (e.g. able-ability, function-dysfunctional), and so a notion of cross-linguistic hegemony or malice is not even applicable here.

________________________________ From: Daniel HENKEL daniel.henkel@univ-paris8.fr Sent: Monday, June 20, 2022 5:36 PM To: Ada Wan adawan919@gmail.com Cc: Christopher Collins Christopher.Collins@ontariotechu.ca; corpora@list.elra.info corpora@list.elra.info Subject: [EXTERNAL] [Corpora-List]Re: Complex Word Identification in French

CAUTION: This email originated from outside of our organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

WAR IS PEACE FREEDOM IS SLAVERY IGNORANCE IS STRENGTH

(G. Orwell, 1984)

On 20/06/2022 23:27, Ada Wan wrote: Yeah... I really don't know what to do with "the resistance", "the ignorance" -- Daniel HENKELhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Funiv-paris8.academia.edu%2FDanielHENKEL&data=05%7C01%7Cmflor%40ets.org%7C8f132ab3c3e346463d6008da53050aea%7C0ba6e9b760b34fae92f37e6ddd9e9b65%7C0%7C0%7C637913578392113374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=RWbwLCwmZDfBcnblPg1sICtlBkmixypzFezw3GAnDv8%3D&reserved=0 Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569 TransCrit Université Paris 8 Vincennes-St-Denis

“non si può stendere una tipologia delle traduzioni, ma al massimo una tipologia di diversi modi di tradurre, volta per volta negoziando il fine che ci si propone – e volta per volta scoprendo che i modi di tradurre sono più di quelli che sospettiamo.” U. Eco

CAUTION: This email originated from outside of our organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

WAR IS PEACE FREEDOM IS SLAVERY IGNORANCE IS STRENGTH

(G. Orwell, 1984)

________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.

Thank you for your compliance.

________________________________

Ada Wan

1:56 a.m.

New subject: [Corpora-List]Re: [EXTERNAL] Re: Complex Word Identification in French

On Tue, Jun 21, 2022 at 12:14 AM Flor, Michael MFlor@ets.org wrote:

...

The notion of 'word' has difficulties in linguistics. But not enough for abandoning it.

Except we don't need it at all --- for both human or machine processing.

...

The argument from the paper "Fairness in Representation for Multilingual NLP" is not convincing at all.

Even if the early findings are correct for transformers ,

...

applicability to human language faculty is not yet supported.

Right, this paper version has not yet addressed the whole story, which I

have yet to continue with. But one can get the gist from conditional probability, context, and finer granularity.

...

On the other hand, it is not even needed. Developmental linguists have noted long ago that babies acquire all natural languages at approximately the same rate (under some 'standard conditions'), despite vast morphological and other differences between languages. Thus, in some sense, all natural human languages are already deemed 'equal' vis-a-vis acquisition complexity.

Well, talk to the NLP crowd or the ones who expect LM/MT results from

different languages should have different performances, even if/when all else were equal. (I remember how hard and how many rounds I had to work for my rebuttals....)

...

For language learning later in life, if one's native language is morphologically rich, learning (some types of) morphologically rich languages (as an adult) is a bit easier than learning a language that is very different, etc.

That's the thing about this paper --- my personal take with L_n learning

is that, no, it's actually also just a length and vocabulary thing wrt whatever one is used to (e.g. with L1), the environment/support available, and +/- personal propensity towards new lang.

...

Complexity of words in a language for non-native speakers/learners is actually a big issue and a field of research in EFL (and now in NLP as well).

See above.

...

Finally, word complexity is often defined within the same language (e.g. able-ability, function-dysfunctional), and so a notion of cross-linguistic hegemony or malice is not even applicable here.

What would it take for me to convince you that such "complexity" really

boils down to just length and vocab (think the examples you gave, viewed from, say, a character perspective)? E.g. is 'Xjfewijpiweoheymqaweopaf'h' more or less complex than 'multiple-dysfunction-prone' to you?

Linas Vepstas

12:13 a.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Hi Ada,

In the English language, "words" are a thing. Children are taught to place spaces between "words". You're not going to undo a millennium-worth of English writing by discouraging the use of words.

Much of Latin was written without blank spaces to denote word boundaries. In Chinese writing, there are no blank spaces to denote word-boundaries. There's assorted NLP software that attempts to guess where those blanks may be, so that Chinese could be segmented and passed into other NLP pipeline stages.

When we speak, verbally, we don't put in "blanks" between words, although there are sometimes pauses. Realistic text-to-speech software NEVER vocalizes words individually, and instead ALWAYS vocalizes the transition between words, and places the break within a single phoneme (I hope it's clear what I am saying here). Thus, from the point of text-to-speech software, words don't exist, because that is a fundamental requirement for normal-sounding speech. (For English.)

Now that we live in the world of statistics and deep learning and whatnot, it's become clear that an audio stream of human speech has some parts that are "highly conserved" (require certain sounds to follow) and other regions which are flexible (just about any other sound can follow). And plenty of stuff in the middle between these two extremes. Surprisingly (or not surprisingly, depending on who you are) the highly variable regions are not word boundaries. Except when ... there are ... well, exceptions.

However, right now, I am not communicating verbally, and so I am faced with the task of converting thoughts into sequences of (discrete) symbols. As I learned in first grade, I do this by placing typed spaces between words.

Sure, the concept of "word" may be quite inappropriate for some obscure languages. This is entirely plausible, as any "synthetic" language defies the concept of "word" (Finnish, Lithuanian consist of "words" many of which are like "antidisestablishmentarianism" and its a children's playground game of creating the longest such possible expression. Creating new words in these languages is like creating new sentences in English. It's just something you do, and there are no "word boundaries" involved.)

Great. So now what? I assume everything I wrote is 100% mainstream, known to any and every linguist, half of whom could amplify and correct all the mistakes I've made in the above. Sure, but so what? You can't get rid of the concept of "word". It's a thing. What, exactly, is being proposed here?

-- linas

On Mon, Jun 20, 2022 at 10:33 AM Ada Wan adawan919@gmail.com wrote:

...

Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us.

Ada Wan

1:25 a.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Hi Linas,

As also a native EN speaker myself, I know "w*rds" is a very colloquial term that gets used often. It got "cemented" via computational implementation* and hence reinforced people's idea of what grammar is or ought to be. I am not "undoing" EN writing (that'd be nonsense --- writing is a thing and that, which is written already, is history, what is more negotiable is our action/reaction/attitude towards it), but rather, trying to get people to be more open-minded and more flexibly-minded about the usage --- in terms of "w*rds" and "grammar", in the context of computing and beyond (as all these are connected with each other), and of input representation when modelling/processing. *And back few decades ago, computing was more EN-dominant. And most working on text representation then were primarily EN monolinguals, hence there could have been a bit of X-centrism (where X can be any language, but in this case, for the relevant historical context, it was EN), with less sensitivity towards other languages and how w segmentation (with a totally unintentional "I am just doing language processing, where language involves words"-mindset) can effect linguistic hegemony in ways that have not been considered by some. But now our systems and processing power is there to give us that fine qualitative difference. So hopefully, instead of "so what?", we can get the community to be more sensitive towards the values of other communities --- in which the notions of w*rds may be different from that in EN, or in which there is no native concept of a "w*rd" (and we don't have to impose one on anyone).

I understand speech processing (both recognition and generation in terms of TTS) is more fine-grained than text processing. Any step towards finer granularity is better than none. I don't know if you are aware of the vocabulary hacking practice in text processing... that's primarily what I was getting at (also how this vocab hacking relates to structural linguistics in some ways).

Statistical methods have always been around. They are not "new" methods. In the tradition of lang sci/tech/eng, they've been somewhat "suppressed" bc ppl kept arguing about grammar and how surface text representations ought to look like. The concept of w*rd defined with whitespace tokenisation is also not satisfactory for EN, think contractions, abbreviations, tons of stuff from intro NLP textbooks. :)

Re "What, exactly, is being proposed here?": in case you have read the paper https://openreview.net/forum?id=-llS6TiOew already, then more empathy, more awareness and sensitivity with inter-cultural/personal values. Our downstream results are good enough. We can switch our concern to more qualitative matters.

Best, Ada

On Tue, Jun 21, 2022 at 12:13 AM Linas Vepstas linasvepstas@gmail.com wrote:

...

Hi Ada,

In the English language, "words" are a thing. Children are taught to place spaces between "words". You're not going to undo a millennium-worth of English writing by discouraging the use of words.

Much of Latin was written without blank spaces to denote word boundaries. In Chinese writing, there are no blank spaces to denote word-boundaries. There's assorted NLP software that attempts to guess where those blanks may be, so that Chinese could be segmented and passed into other NLP pipeline stages.

When we speak, verbally, we don't put in "blanks" between words, although there are sometimes pauses. Realistic text-to-speech software NEVER vocalizes words individually, and instead ALWAYS vocalizes the transition between words, and places the break within a single phoneme (I hope it's clear what I am saying here). Thus, from the point of text-to-speech software, words don't exist, because that is a fundamental requirement for normal-sounding speech. (For English.)

Now that we live in the world of statistics and deep learning and whatnot, it's become clear that an audio stream of human speech has some parts that are "highly conserved" (require certain sounds to follow) and other regions which are flexible (just about any other sound can follow). And plenty of stuff in the middle between these two extremes. Surprisingly (or not surprisingly, depending on who you are) the highly variable regions are not word boundaries. Except when ... there are ... well, exceptions.

However, right now, I am not communicating verbally, and so I am faced with the task of converting thoughts into sequences of (discrete) symbols. As I learned in first grade, I do this by placing typed spaces between words.

Sure, the concept of "word" may be quite inappropriate for some obscure languages. This is entirely plausible, as any "synthetic" language defies the concept of "word" (Finnish, Lithuanian consist of "words" many of which are like "antidisestablishmentarianism" and its a children's playground game of creating the longest such possible expression. Creating new words in these languages is like creating new sentences in English. It's just something you do, and there are no "word boundaries" involved.)

Great. So now what? I assume everything I wrote is 100% mainstream, known to any and every linguist, half of whom could amplify and correct all the mistakes I've made in the above. Sure, but so what? You can't get rid of the concept of "word". It's a thing. What, exactly, is being proposed here?

-- linas

On Mon, Jun 20, 2022 at 10:33 AM Ada Wan adawan919@gmail.com wrote:

...
Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

-- Patrick: Are they laughing at us? Sponge Bob: No, Patrick, they are laughing next to us.

Miloš Jakubíček

22 Jun 22 Jun

1:41 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Hi Ada,

a very good paper (and lot of work done - congrats!) and a very interesting thread.

Clearly linguistics as a field is terribly lacking some unified taxonomy (compared to biology, chemistry, whatever -- the difference is rather striking) and yes, this is causing serious trouble in NLP and general (by linguists spending time to promote "their" definition instead of promoting harmonization regardless of their own preferences and traditions). But "word" here is not a special case -- it's just one of the many linguistic terms that are ill-defined.

When you say:

*It is of the best interest of the community to discontinue the usage of "word".*

you must realize this is never going to happen (and just results in getting an Orwellian hail, quite understandably, you'll not find a lot of support for prescriptive views on language on this list :). This is for both practical reasons ("word" is a word everyone is used to use, it is normal, and being normal is the strongest card you can play in language) as well as theoretical ones (the problem is not the string-label, but the concept itself, so changing the label is of no help here)

So, taking this practically: - you may try to convince linguists to make some harmonized taxonomy (and I wished I knew how to be of any help here, but I don't think that majority is seeing it as a problem at all - they see it as "complexity/property of language" - so I don't believe this is going to happen anytime soon) - you may promote the awareness of how word-processing impacts performance and evaluation of NLP tools

The latter I think is much more useful and more likely to at least partially succeed -- and I was really happy to see that your paper quantifies that in state-of-the-art techniques (you may have a look at a 2016 PhD thesis of one of my ex-colleagues: https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based language modelling). It is an omnipresent problem that actually starts with English and simple tasks like PoS tagging in a slightly different way: the impact of not following exactly the same tokenization (esp. for high frequency items like "don't") the tagger expects/was trained on, is huge. In Sketch Engine where we use tons of word-processing tools (mostly segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization right is often the most difficult job; and of course, more so for languages where word is more of an artificial and arbitrary concept. From a non-academic perspective, the issue is that users expected something familiar like words, and many are aware of the level of arbitrariness of "words" in their focus language.

So, reading:

*Well, talk to the NLP crowd or the ones who expect LM/MT results from different languages should have different performances, even if/when all else were equal. (I remember how hard and how many rounds I had to work for my rebuttals....)*

Indeed, this is very much what everyone is used to :-/ From a purely technical perspective, switching to characters (or bytes -- which however are not a good level of abstraction in terms of interpretation, especially with variable-length encodings like UTF-8) sometimes is the right thing to do (though the figures get easily skewed -- esp. in a web corpus with plenty of long URLs) And sometimes the desired level of abstraction is the other way around, arriving at MWE's being even more of a nightmare than poor "words", whatever they represent. And sometimes, the best way is to keep using "words" with lots of policing around what they are, which btw might very well be Christopher's case with French. Which way to go depends on particular use cases, so: is "word" a well defined unit? Certainly not. Is it a useful one? Sometimes yes.

Cheers and "*I wished the world would give more worth to data*"-too Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu

On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:

...

Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

Ada Wan

23 Jun 23 Jun

6:54 p.m.

New subject: [Corpora-List]Re: Complex Word Identification in French

Hi Milos,

Pardon my late reply. I actually took the time and found some joy in reading Vít's dissertation (thanks for pointing me to it) --- a "kindred spirit" to my paper in various versions (FaiR short https://openreview.net/forum?id=-llS6TiOew, R&B https://openreview.net/forum?id=dKwmCtp6YI, FaiR long https://drive.google.com/file/d/1eKbhdZkPJ0HgU1RsGXGFBPGameWIVdt9/view). I wish I had read his work earlier and I could have sited his work! I found it interesting, in an insightful way, in his relating structure to segmentation (p.7). But overall, just the willingness to move on to the byte space for processing shows linguistic maturity, as opposed to those who insist to evaluate "language" based on "w*rds" and "sentences". (I wish people weren't so "hooked" on "w*rds"... it is just a different way of looking at things, no? When I first submitted my work to confs, I thought ppl would just get it as "so 'w*rds' are not so good, so let's switch", but instead, I got totally beaten up (metaphorically) for it...) I think we could process in bytes, and "back-interpret" to chars.

I think I am on the same page with you on many things. E.g. re "w*rd": I absolutely agree that it is not the string label but the concept itself (de dicto vs de re --- fixing the former does not entail fixing the latter, same with prejudices that are in the humans with all the "PC-terms" etc. --- at the end of the day, it is our attitude towards everything that counts, not just what we say). As nomenclature for processing, however, I think we could use "token", where token can be of any granularity, e.g. "char token" as a char unigram.... I didn't mean to "police" ppl's colloquial use of "w*rd" --- I used this form to help raise awareness also. There is some implicit linguistic hegemony/colonial past embodied therein in the history of our field.

Re data: I wish we could have a data-centric statistical science for language! It would help update linguistics beyond the canonical structural linguistics framework based on grammar, wellformedness etc.. As we are all data specialists in a way, and we are interested in data --- why not understand its statistical properties? Corpus linguistics has mostly been w*rd-based. Most practitioners at ML confs hold data constant to test algos, why can there be an event where we hold algos constant and test data?

Thanks and best Ada

p.s. to the community: sorry that my initial comment to this thread turned the thread into a discussion for my paper --- so it seems. But I continue to welcome feedback on my work as I am still working on an extended version of it (and please lmk if I should start another thread instead (the thread originator had been removed from this thread as per his request)!) I think, in general, the ELRA/corpora-list folks can be more experienced and sophisticated than most on the "MLNLP" folks, e.g. on twittersphere. I would appreciate your input.

On Wed, Jun 22, 2022 at 1:42 PM Miloš Jakubíček < milos.jakubicek@sketchengine.eu> wrote:

...

Hi Ada,

a very good paper (and lot of work done - congrats!) and a very interesting thread.

Clearly linguistics as a field is terribly lacking some unified taxonomy (compared to biology, chemistry, whatever -- the difference is rather striking) and yes, this is causing serious trouble in NLP and general (by linguists spending time to promote "their" definition instead of promoting harmonization regardless of their own preferences and traditions). But "word" here is not a special case -- it's just one of the many linguistic terms that are ill-defined.

When you say:

*It is of the best interest of the community to discontinue the usage of "word".*

you must realize this is never going to happen (and just results in getting an Orwellian hail, quite understandably, you'll not find a lot of support for prescriptive views on language on this list :). This is for both practical reasons ("word" is a word everyone is used to use, it is normal, and being normal is the strongest card you can play in language) as well as theoretical ones (the problem is not the string-label, but the concept itself, so changing the label is of no help here)

So, taking this practically:

you may try to convince linguists to make some harmonized taxonomy (and

I wished I knew how to be of any help here, but I don't think that majority is seeing it as a problem at all - they see it as "complexity/property of language" - so I don't believe this is going to happen anytime soon)

you may promote the awareness of how word-processing impacts performance

and evaluation of NLP tools

The latter I think is much more useful and more likely to at least partially succeed -- and I was really happy to see that your paper quantifies that in state-of-the-art techniques (you may have a look at a 2016 PhD thesis of one of my ex-colleagues: https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based language modelling). It is an omnipresent problem that actually starts with English and simple tasks like PoS tagging in a slightly different way: the impact of not following exactly the same tokenization (esp. for high frequency items like "don't") the tagger expects/was trained on, is huge. In Sketch Engine where we use tons of word-processing tools (mostly segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization right is often the most difficult job; and of course, more so for languages where word is more of an artificial and arbitrary concept. From a non-academic perspective, the issue is that users expected something familiar like words, and many are aware of the level of arbitrariness of "words" in their focus language.

So, reading:

*Well, talk to the NLP crowd or the ones who expect LM/MT results from different languages should have different performances, even if/when all else were equal. (I remember how hard and how many rounds I had to work for my rebuttals....)*

Indeed, this is very much what everyone is used to :-/ From a purely technical perspective, switching to characters (or bytes -- which however are not a good level of abstraction in terms of interpretation, especially with variable-length encodings like UTF-8) sometimes is the right thing to do (though the figures get easily skewed -- esp. in a web corpus with plenty of long URLs) And sometimes the desired level of abstraction is the other way around, arriving at MWE's being even more of a nightmare than poor "words", whatever they represent. And sometimes, the best way is to keep using "words" with lots of policing around what they are, which btw might very well be Christopher's case with French. Which way to go depends on particular use cases, so: is "word" a well defined unit? Certainly not. Is it a useful one? Sometimes yes.

Cheers and "*I wished the world would give more worth to data*"-too Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu

On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:

...
Hi Christopher,

It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.

Feel free to get in touch if you should have any questions.

Best, Ada

On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:

...
Hello,

I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.

As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.

We found this but that’s it: https://github.com/sheffieldnlp/cwi

Would appreciate any tips!

Thanks,

Chris

*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info

1264

Age (days ago)

1267

Last active (days ago)

corpora@list.elra.info

16 comments

7 participants

tags (0)

participants (7)

Ada Wan
Christopher Collins
Daniel HENKEL
Flor, Michael
Linas Vepstas
Miloš Jakubíček
Sylvain Kahane