[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...

24 Jul 2023


      ...
"mathematical purity" ... how you can use vector/tensor algebra with texts
I'd suggest using the search word "embeddings" instead of "tensor".
The concept is being used in other fields, even physics, but (sticking 
with linguistics) if you've not looked into Word2Vec yet that is a good 
place to appreciate how human language and linear algebra come together.
It is normally introduced as a ready-made model of dim 300, trained on 
millions of words. Like you I wanted to understand what it was actually 
doing, so a few years ago I did a presentation using just two dimensions 
and a handful of words and sentences, then plotting the embeddings found 
for each word. You can add or remove a sentence at a time to see what it 
is learning from each.
You can see how each dimension is being given some meaning, even if they 
are not the way a human linguist would have structured it.
It is also a good test bed for finding the limits, such as playing 
around with ambiguous words and proper nouns, increasing the amount of 
training data without increasing dimension, etc.
Darren
P.S. The embedding layer is the first layer in transformers, the layer 
where tokens ("words") are turned into numbers, typically of dim 512 or 
higher. But note that they are randomly generated, not initialized from 
word2vec or similar. And any modification to their initial randomness is 
to please the layers above, not humans trying to peer inside the box.
P.P.S. I think you might also enjoy 
https://transformer-circuits.pub/2021/framework/index.html  which is 
exploring how transformers work at a very low-level.
The gap between their minimalist models and something like ChatGPT is 
huge, though, and reading their work isn't going to help you appreciate 
why ChatGPT says stupid things to you.

2025

2024

2023

2022

[Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...