Paolo Monella Post-doc scholarship in Digital Humanities Accademia dei Lincei, Rome 2012

Modelling: work in progress

In what follows I expand on the digital model to represent primary sources (such as manuscripts) prototyped by Prof. Tito Orlandi in his experimental digital edition of Machiavelli's De principatibus. These are my working notes, therefore work in progress.

See a 500-words abstract on this subject.

Any comments? Check my contact page.

Definitions

Notational text (or literary work): The abstract, ideal "text" of which a single witness' "text" is a representation. Examples of text witnesses include manuscripts, inscriptions, print editions, digital editions etc.).
Encoding system of a language: The system of symbols used to represent a text in that language on a support. Such "encoding systems" include 1) written systems, constituted by lists of graphemes meant for handwriting or ink print on physical supports ("alphabets" and paragraphematic signs like punctuation, quotes etc.), and the related graphical conventions (capitalisation, blank spaces, page outline etc.); 2) digital encoding systems like the one that a modern digital philologist may devise to build a digital edition of a text.
Representation (or model) of the text: A representation of the "notational text" (the literary work, e.g. the Iliad) encoded with an encoding system (in handwriting, print or digital). Handwritten or print prmary sources bear text representations encoded by means of a system of graphic signs and conventions on paper or other phisical support.
Written vs. digital sources/representations: By this wording I will refer to the representations (or the sources bearing them) encoded through handwriting and print techniques ("written"), as opposed to those encoded through digital technologied ("digital").
Witness/source/edition of the text: I make no actual distinction between the two terms. A manuscript, for instance, is a textual witness (or source) of the text for the modern philologist, but was an edition of it for the medieval scribe (i. e. philologist) who wrote it. They are "editions" for those who produce or simply read them, and "witnesses" or "sources" for philologists who use them to reconstruct the notational text. Here I tend to say that witnesses bear (not are) representations of the text, though this is still an open issue for me.

Assumptions

The process of creating a scholarly edition of a literary work and its textual tradition is based upon a comparison (collatio) of the representations of the text in primary sources.

In order to do so, a digital scholarly edition must rely on digital representations (commonly called "digital transcriptions", but I'd rather say "digital models") of primary sources, formalised in a way that allows the computer to compare such representations with each others.

I will draw most of my examples from the textual transmission of Latin texts through medieval manuscripts and early print editions.

Manuscripts and alphabets

Thesis: Each textual witness encodes the text with a different encoding system.

I will list some discrepancies between different written encoding systems Latin.

Ancient Romans used a written alphabet (graphical encoding system) based on capitalised letters, so no distinction between smaller case and lower case. They did not use punctuation to divide sentence, and in some cases did not use spaces to distinguish words. They did not sense a phonetic distinction between a /u/ and a /v/ sound, so it may be said that our /u/ and /v/ phonemes are not mappable to Latin. Therefore, they did not have a /u/ and a /v/ grapheme, but only a [V] grapheme that corresponded to the Latin sound probably corresponding to IPA /u/ and /w/. We may say that that alphabet (as a part of their graphical encoding system) did not inlcude a grapheme [v].

A writing convention of the Modern Age distinguished between a [i] grapheme for vocal Latin /i/ (as in "iter") and a [j] grapheme for semivocal latin /j/ (as in "jus"). That encoding system therefore included a [j] extra grapheme (as opposed to the one used by ancient Romans or by most contemporary print conventions).

Among contemporary Latin scholarly print editions, some are based on an alphabet with separate [u] and [v] graphemes ("votum"), some are based on an alphabet with only a [u] sign ("uotum").

No need to say that use of punctuation in different encoding systems varies immensely, from the ancient times when it didn't exist, through different medieval and modern usages.

The usage of horizontal spaces to separate words is still problematic for Latin. One can mention a number of cases:

The "-que" enclitic. Does "Senatusque" constitute a word or two?
One may compare "Senatusque" with "quibusque". Morphologically, "quisque" works as a "quis" word (normally declined at the end) plus a "-que" enclitic. It is a purely linguistic question to define "quisque" as one word or not.
"Tantummodo", "dehinc", "quocum", "non vis" vs. "nolo" (keep in mind that probably ancient Romans did not pronounce the final /n/ of "non" distinctly, so the fusion in "nolo" may come from a /nôwolo/ pronunciation, where by /ô/ I mean a nasal /o/). The conventions for word distinction vary very much in time, so it would not surprise to find spelling pattern like "dehinc" extended to cases like "de hoc" in a manuscript.

[To be continued...]

TEI

Thesis: In order to make the representations of the text of different primary sources comparable, the TEI module 11 Representation of Primary Sources implies a 'normalisation' of the encoding systems of textual sources that is not explicitly declared and documented.

The TEI module 11 Representation of Primary Sources seems to work for modern sources, such as contemporary writers' autographs or printed texts - for which it has been primarily devised. It assumes that the encoding system (starting from the "alphabet") upon which each primary souces is built is confrontable with that of other sources, and with the one that is used for our own edition.

[To be continued...]

To be continued:

(The wording must be refined)

We must create digital representations of the written representations of the text in the primary sources (often called "digital transcriptions of primary sources").

For each written sources, we must create (and declare explicitly) a digital model mirroring its (peculiar) written encoding system.

Given that the written encoding ystems of different textual sources do not overlap, also our digital models of those systems will not overlap.

This makes the digital representations of the written representations of the text not directly comparable with each other by the computer.

In order to make those digital repesentations comparable, we must add a level of representation, and map

the peculiar enconding system of each source to an uniform digital encoding system
the [A: digital representation of the written representation of each linguistic unit] ("word?" this is an open issue) to [B: a digital representation of that linguistic unit in our uniform digital encoding system].

The former representation of that unit will be computationally comparable (traditional "collatio") to the corresponding representations of that unit in the digital representations of the other witnesses.

The passage from A to B above is another open issue:

is it a direct, computationally automatisable passage? E. g.: from A: &dns_tilde;, the digital representation of the written brachigrapy "dñs" for "dominus" -- to B: "dominus" (that is a string of ASCII characters representing "d", "o", "m" etc. in our digital encoding system)
or must the philologist go from A (&dns_tilde;) to C (the linguistic unit of "Latin" -- i. e. his model of the Latin language -- that he knows as nominative singular form of lemma "dominus, -i", masculine of the 2nd declension), and then from C to B (ASCII "d", "o", "m" etc.)?

If this is the case, the TEI <abbrev> code substantially fails in that it does not imply any distinction between

1. the digital representation of the written representation of the text (e. g. the encoding of brachigraphies in <abbrev> elementsand
2. and the uniformised representation of the original spelling

The use of the same ASCII (or Unicode) symbols for almost everything has the effect of hiding the underlying teoretical issue, and coding solutions as the <abbrev> element fail in keeping the two levels dinstinct.

To be continued...