when it comes to corpora research the response time to queries such as:
* what is the character on the nth offset of a file * which ones are all other characters preceding and proceeding that one by m offsets or up to a certain char or pattern ... * what is the intra- and inter-textuality of a given segment of characters . . . and many other related ones, should be "zero comma nada" (they should run instantly), but I think this is virtually impossible because texts these says (say, PDF files) are, basically, visually appealing containers of streams of data displayed by rendering engines; HTML files contain all that javascript cr@p, google goo, ads, insufferably idiotic "we care about your privacy" road blocks, ...
I haven't found a convincing explanation as to why that is the case, but I can't quite understand why is it that the MVC pattern is well understood when it comes to software design, but people can't apparently fully separate the text from its presentation when it comes to documents.
"Web as corpus" folks:
https://www.researchgate.net/publication/276511711_Maristella_Gatto_Web_as_c...
don't even attempt to address those issues. At the end of the day as Borges said:
" ... el nombre es arquetipo de la cosa en las letras de 'rosa' está la rosa y todo el Nilo en la palabra 'Nilo'"
so, let's get down to first manage to get one character in a text after the other ...
lbrtchx