Hi Ada good to hear from you,
The project is called: "The Bridge". https://bridge.haverford.edu/ I am not the PI. The project has been in existence for about 12 years. I was invited to become involved through my Drexel LEADING Fellowship. Here is a paper we published this summer: https://hughandbecky.us/Hugh-CV/publication/2023-bridging-corpora/4LR_pre_pr...
The Bridge is a linked data application supporting curriculum development. It was developed with Latin in mind, but has been extended to Greek as well. It quickly helps instructors and students find new vocabulary words in newly assigned texts, based on texts they have already encountered in their curriculum.
The current workflow takes a variety of texts from several sources and then stores the lemmas for comparison across texts and broad stats generation. I see value in modeling the whole text not just the lemmas as this may allow future services. So, while NIF could model the whole text, the current operational activities really only involve using lemmas. To move forward in a linked data model we need to support current operations. More broadly, I see the lemmas as an "annotation" or abstraction layer whereas I would see the actual content of texts as the "source data". Using linked data and lemmas allows the bridge to connect via lemmas to LiLa data. https://lila-erc.eu/
Kind regards, Hugh
On Tue, Oct 10, 2023 at 3:39 AM Ada Wan adawan919@gmail.com wrote:
Dear Hugh
What project are you working on that still requires lemmatization? Would it not be a better approach to use (sub-)character n-grams (esp. if you are doing textual analysis/interpretation, vs. processing which can be byte-based) to decipher what segments would occur most frequently first and (re-)analyze from there? I understand there has been a habit in the "language space" to call certain segments "lemmata". I am curious to know what one can do as a community, though, to transition to more general methods (and interpretations on "language").
Thanks and best Ada
On Tue, Oct 10, 2023 at 12:15 AM Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
Greetings,
I am working on a project which is using lemmatization. I'm wondering how people have approached combining NIF and lemmatization. are there any "blessed" extensions or ontologies? I'm not seeing nif:lemma as an option within the nif ontology... though I am likely missing something.
Kind regards,
- Hugh
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear Hugh (if I may), I was just about to answer mentioning LiLa, the project I’ve been involved with, but I see that you’re already aware of it!
In LiLa, we faced the same problem: there doesn’t seem to be a standard vocabulary for doing lemmatization as a linking task, connecting a token to a canonical-form resource instead of a simple string. We ended up defining a custom property in the LiLa ontology, called (without much fantasy) lila:hasLemma: https://lila-erc.eu/lodview/ontologies/lila/hasLemma
(the property has an unspecified domain, but it is generally used with corpus tokens and a range to Lemmas as defined in the same ontology).
Here you can find an example of a corpus token that makes use of the property: https://lila-erc.eu/data/corpora/ITTB/id/token/005.SCG*LB1.CP--++1.N.-5.1-1.... (note that we use POWLA as ontology for corpus representation).
By the way, we also provide a text-linking service to lemmatize and link Latin texts, which can also work programmatically via an API (or via the CLARIN Switchboard, but it’s currently still in the beta version): https://lila-erc.eu/LiLaTextLinker/ https://lila-erc.eu/LiLaTextLinker/process?text=Gallia%20est%20omnis%20divis....
A final note (maybe to be continued on personal communication off list), we would be very interested in the connection between LiLa and Bridge, so we would be more than happy to get in touch and help.
Best, Francesco
Da: Hugh Paterson III via Corpora corpora@list.elra.info Data: martedì, 10 ottobre 2023 alle ore 19:14 A: Ada Wan adawan919@gmail.com Cc: corpora corpora@list.elra.info Oggetto: [Corpora-List] Re: NIF: NLP Interchange Format Hi Ada good to hear from you,
The project is called: "The Bridge". https://bridge.haverford.edu/ I am not the PI. The project has been in existence for about 12 years. I was invited to become involved through my Drexel LEADING Fellowship. Here is a paper we published this summer: https://hughandbecky.us/Hugh-CV/publication/2023-bridging-corpora/4LR_pre_pr...
The Bridge is a linked data application supporting curriculum development. It was developed with Latin in mind, but has been extended to Greek as well. It quickly helps instructors and students find new vocabulary words in newly assigned texts, based on texts they have already encountered in their curriculum.
The current workflow takes a variety of texts from several sources and then stores the lemmas for comparison across texts and broad stats generation. I see value in modeling the whole text not just the lemmas as this may allow future services. So, while NIF could model the whole text, the current operational activities really only involve using lemmas. To move forward in a linked data model we need to support current operations. More broadly, I see the lemmas as an "annotation" or abstraction layer whereas I would see the actual content of texts as the "source data". Using linked data and lemmas allows the bridge to connect via lemmas to LiLa data. https://lila-erc.eu/
Kind regards, Hugh
On Tue, Oct 10, 2023 at 3:39 AM Ada Wan <adawan919@gmail.commailto:adawan919@gmail.com> wrote: Dear Hugh
What project are you working on that still requires lemmatization? Would it not be a better approach to use (sub-)character n-grams (esp. if you are doing textual analysis/interpretation, vs. processing which can be byte-based) to decipher what segments would occur most frequently first and (re-)analyze from there? I understand there has been a habit in the "language space" to call certain segments "lemmata". I am curious to know what one can do as a community, though, to transition to more general methods (and interpretations on "language").
Thanks and best Ada
On Tue, Oct 10, 2023 at 12:15 AM Hugh Paterson III via Corpora <corpora@list.elra.infomailto:corpora@list.elra.info> wrote: Greetings,
I am working on a project which is using lemmatization. I'm wondering how people have approached combining NIF and lemmatization. are there any "blessed" extensions or ontologies? I'm not seeing nif:lemma as an option within the nif ontology... though I am likely missing something.
Kind regards, - Hugh _______________________________________________ Corpora mailing list -- corpora@list.elra.infomailto:corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.infomailto:corpora-leave@list.elra.info [http://static.unicatt.it/ext-portale/5xmille_firma_mail_2023.jpg] https://www.unicatt.it/uc/5xmille
Related to "The Bridge" is my own Greek Learner Texts Project https://greek-learner-texts.org which relies heavily on lemmatization for building vocabulary lists.
At the Perseus Digital Library https://scaife.perseus.org , we also make extensive use of lemmatization of texts to link to dictionaries, etc.
James
On Tue, Oct 10, 2023 at 1:11 PM Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
Hi Ada good to hear from you,
The project is called: "The Bridge". https://bridge.haverford.edu/ I am not the PI. The project has been in existence for about 12 years. I was invited to become involved through my Drexel LEADING Fellowship. Here is a paper we published this summer: https://hughandbecky.us/Hugh-CV/publication/2023-bridging-corpora/4LR_pre_pr...
The Bridge is a linked data application supporting curriculum development. It was developed with Latin in mind, but has been extended to Greek as well. It quickly helps instructors and students find new vocabulary words in newly assigned texts, based on texts they have already encountered in their curriculum.
The current workflow takes a variety of texts from several sources and then stores the lemmas for comparison across texts and broad stats generation. I see value in modeling the whole text not just the lemmas as this may allow future services. So, while NIF could model the whole text, the current operational activities really only involve using lemmas. To move forward in a linked data model we need to support current operations. More broadly, I see the lemmas as an "annotation" or abstraction layer whereas I would see the actual content of texts as the "source data". Using linked data and lemmas allows the bridge to connect via lemmas to LiLa data. https://lila-erc.eu/
Kind regards, Hugh
On Tue, Oct 10, 2023 at 3:39 AM Ada Wan adawan919@gmail.com wrote:
Dear Hugh
What project are you working on that still requires lemmatization? Would it not be a better approach to use (sub-)character n-grams (esp. if you are doing textual analysis/interpretation, vs. processing which can be byte-based) to decipher what segments would occur most frequently first and (re-)analyze from there? I understand there has been a habit in the "language space" to call certain segments "lemmata". I am curious to know what one can do as a community, though, to transition to more general methods (and interpretations on "language").
Thanks and best Ada
On Tue, Oct 10, 2023 at 12:15 AM Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
Greetings,
I am working on a project which is using lemmatization. I'm wondering how people have approached combining NIF and lemmatization. are there any "blessed" extensions or ontologies? I'm not seeing nif:lemma as an option within the nif ontology... though I am likely missing something.
Kind regards,
- Hugh
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear all
Thanks for your replies. I understand that lemmatization might have been a historical practice for many linguistic and literary studies. What can be done to translate and transition from such analytical methodology/framework as well as infrastructural settings to enable working with texts in a more general fashion (e.g. through simple character matching), while preserving all the rights to and availability of data?
Thanks and best Ada
On Tue, Oct 10, 2023 at 7:45 PM James Tauber jtauber@jtauber.com wrote:
Related to "The Bridge" is my own Greek Learner Texts Project https://greek-learner-texts.org which relies heavily on lemmatization for building vocabulary lists.
At the Perseus Digital Library https://scaife.perseus.org , we also make extensive use of lemmatization of texts to link to dictionaries, etc.
James
On Tue, Oct 10, 2023 at 1:11 PM Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
Hi Ada good to hear from you,
The project is called: "The Bridge". https://bridge.haverford.edu/ I am not the PI. The project has been in existence for about 12 years. I was invited to become involved through my Drexel LEADING Fellowship. Here is a paper we published this summer: https://hughandbecky.us/Hugh-CV/publication/2023-bridging-corpora/4LR_pre_pr...
The Bridge is a linked data application supporting curriculum development. It was developed with Latin in mind, but has been extended to Greek as well. It quickly helps instructors and students find new vocabulary words in newly assigned texts, based on texts they have already encountered in their curriculum.
The current workflow takes a variety of texts from several sources and then stores the lemmas for comparison across texts and broad stats generation. I see value in modeling the whole text not just the lemmas as this may allow future services. So, while NIF could model the whole text, the current operational activities really only involve using lemmas. To move forward in a linked data model we need to support current operations. More broadly, I see the lemmas as an "annotation" or abstraction layer whereas I would see the actual content of texts as the "source data". Using linked data and lemmas allows the bridge to connect via lemmas to LiLa data. https://lila-erc.eu/
Kind regards, Hugh
On Tue, Oct 10, 2023 at 3:39 AM Ada Wan adawan919@gmail.com wrote:
Dear Hugh
What project are you working on that still requires lemmatization? Would it not be a better approach to use (sub-)character n-grams (esp. if you are doing textual analysis/interpretation, vs. processing which can be byte-based) to decipher what segments would occur most frequently first and (re-)analyze from there? I understand there has been a habit in the "language space" to call certain segments "lemmata". I am curious to know what one can do as a community, though, to transition to more general methods (and interpretations on "language").
Thanks and best Ada
On Tue, Oct 10, 2023 at 12:15 AM Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
Greetings,
I am working on a project which is using lemmatization. I'm wondering how people have approached combining NIF and lemmatization. are there any "blessed" extensions or ontologies? I'm not seeing nif:lemma as an option within the nif ontology... though I am likely missing something.
Kind regards,
- Hugh
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info