A document level metric for TFIDF for auction entries - Corpora

11 Jun 2024


      Dear community,
I am running a clustering experiment for my project using various metrics to cluster fine art auction catalogue entries. 
Using domain knowledge, I have extracted certain features from the text to cluster. Alongside these specialist features, I would like to input some form of Tfidf metric that measures the semantics and vocabulary used relative to the rest of the vocabulary used in other entries as a corpus of words.
The step I have taken so far:
Text preprocessing and cleaning 
Tokenization
TFIDF vector processing with scikit
From TFIDF scores of individual words calculated a mean TFIDF score across a document (auction entry)
My feeling is that this might have some weaknesses (skew impact from longer documents in the corpus / or impact of high TDIDF scores from shorter documents) but could capture some of the rarer choices of vocabulary in a sentence relative to other entries.
The idea behind the average TFIDF score is to get a document level capture of how unusual the vocabulary - in a specific document - is  is relative to the entire corpus of auction entries. 
I would be interested to hear if anyone has had experience with such a methodology or even any feedback on it.
Does anyone in the group have any experience or thoughts on this as a sentence/document level information capture?
Other word and sentence embeddings are clear alternatives but I favour TFIDF as it has the advantage of being specific to the vocabulary used in my domain specialist data set.
Best wishes, 
Mathew