[Corpora-List] towards a "pan document format" (pun intended) . . .

14 Apr 2023


      when it comes to corpora research the response time to queries such as:
* what is the character on the nth offset of a file
 * which ones are all other characters preceding and proceeding that
one by m offsets or up to a certain char or pattern ...
 * what is the intra- and inter-textuality of a given segment of characters
 . . .
 and many other related ones, should be "zero comma nada" (they should
run instantly), but I think this is virtually impossible because texts
these says (say, PDF files) are, basically, visually appealing
containers of streams of data displayed by rendering engines; HTML
files contain all that javascript cr@p, google goo, ads, insufferably
idiotic "we care about your privacy" road blocks, ...
I haven't found a convincing explanation as to why that is the case,
but I can't quite understand why is it that the MVC pattern is well
understood when it comes to software design, but people can't
apparently fully separate the text from its presentation when it comes
to documents.
"Web as corpus" folks:
https://www.researchgate.net/publication/276511711_Maristella_Gatto_Web_as_c...
don't even attempt to address those issues. At the end of the day as
Borges said:
" ... el nombre es arquetipo de la cosa en las letras de 'rosa' está
la rosa y todo el Nilo en la palabra 'Nilo'"
so, let's get down to first manage to get one character in a text
after the other ...
lbrtchx

2026

2025

2024

2023

2022

[Corpora-List] towards a "pan document format" (pun intended) . . .