Paolo Monella Post-doc scholarship in Digital Humanities Accademia dei Lincei, Rome 2012

Vespa Project

The Iudicium coci et pistoris by Vespa, an experimental scholarly digital edition

Status of the project

Short version: The project is now discontinued (and the edition incomplete). I am now (2016) continuing my work on these methodological principles with the Ursus Project.

Longer version: I developed this project in the last months of a 12-months post-doc bourse in 2012 at the Accademia dei Lincei. The project goal was to provide a proof-of-concept prototype of scholarly digital edition, but I could not complete the edition at the end of the bourse (December 2012). After that date, the development of the project has been very slow, though I have exposed its methodological principles at a number of venues (see, for example, this article in the proceedings of the first AIUCD conference). From 2015/16 on, I definitely quit the work on the Vespa Project and started applying the same methodological principles to a new edition: see the Ursus Project within the ALIM research framework.

By-products

I am working at a digital scholarly edition of the Iudicium coci et pistoris iudice Vulcano by Vespa (Anth. Lat. 199 Riese) a Latin text in verse from the Late Antiquity, for which the main MS is the Codex Salmasianus.

The full rationale of this edition, which tests a number of experimental features, is discussed in detail in the talk Many witnesses, many layers. The abstract and the slides of the talk, as well as the pre-print full text of the published article, are on this page. The files (CSV, XML and Python) that compose the edition are in the paolomonella/vespa GitHub repository. A concise description of its rationale is in this webpage.

According to the assumption of the identification of three different layers (among many possible others) in the text, which I will call "graphic" (graphemes, paragraphematical signs and other graphic signs), "alphabetic" (alphabetic letters, or "alphabemes") and "linguistic" (inflected words), I add to my digital edition, as a functional component, two "Tables of signs" which will list the idividual signs corresponding to one encoding symbolic unit: a table for graphical signs and another for alphabetic letters.

The files I am working with are in the paolomonella/vespa GitHub repository (beware: work in progress!).

I am currently exploring different ways to linearise my "musical score" edition model (the relevant working files are linked and described below):

Linearisation A: a different XML/TEI file for each transcription layer. Alignment is done through <link> elements included in external XML/TEI files.
Linearisation B: different transcription layers are encoded using Menota 2.0 schema, all in one XML/TEI file. The alignment is still done through <link> elements, but in the same file.

At the bottom of this page, I am also publishing:

a roadmap and a
to-do list

Linearisation A (many XML files)

Linearisation A: a different XML/TEI file for each transcription layer. Alignment is done through <link> elements included in external XML/TEI files.

The files listed below are the ones I am currently experimenting with. They are available at the paolomonella/vespa GitHub repository. Some of them are described in my talk Many witnesses, many layers. Some methodological issues are more deeply discussed in my talk In the Tower of Babel; a more detailed discussion is in the article I derived from that talk:

align_alph_graph.xml This file aligns the alphabetic and the graphic transcriptions.
align_alph_ling.xml This file aligns the alphabetic and the linguistic transcriptions.
align_graph_ling.xml This file aligns the alphabetic and the graphic transcriptions.
alphabetic.xml The alphabetic transcription.
graphic.xml The graphic transcription. It now includes the <charDecl> element.
input.py The Python script that processes the csv files and generates all xml files.
linguistic.xml The linguistic transcription.
table_graphemes.csv The table of signs for the graphic layer.
table_alphabemes.csv The table of signs for the alphabetic layer.
transcription.csv The file in which I key the transcription (I am currently keying 'by hand' the graphic and the linguistic transcriptions).

My current workflow in a nutshell: I edit the two csv files 'by hand'. Script input.py processes the two csv files and generates the xml files. The xml files whose name starts with align_ are the alignment files. The script also transforms the "tables of signs" into a complex <charDecl> element (and prepends it to relevant xml files).

This is what the source code of the three transcription files looks like (the arrows represent the alignment):

In the yellow rectangle below you can see a snippet of the source code of one of the alignment files (namely file align_alph_graph.xml):

Linearisation B (Menota)

Chapter 3 of The Menota Handbook v 2.0 also allows to encode a text at three layers, which Menota calls "facsimile", "diplomatic" and "normalised" (roughly corresponding to my "graphical", "alphabetic" and "linguistic"). To do so, they added three elements to XML/TEI, namely <me:facs> <me:dipl> <me:norm>.

The resulting code looks like this:


<w>
   <choice>
      <me:facs>&drot;<am>&osup;</am>ttin<am>&bar;</am></me:facs>
      <me:dipl>d<ex>ro</ex>ttin<ex>n</ex></me:dipl>
      <me:norm>Dróttinn</me:norm>
   </choice>
</w>

The main differences between the current Menota encoding practice (as far as I know it) and the goals of the present edition are the following:

The finest granularity allowed by the Menota markup scheme is word-level granularity, while I need alignment at grapheme-level granularity
All three Menota transcription layers share the same set of 'characters', while my "graphical" and "alphabetic" transcriptions have each a different set of elements, each one completely described in a specific 'table of signs', and my "linguistic" transcription does not encode inflected words as a sequence of letters, but with unique IDs
I want to have formal and explicit definitions of each element (grapheme and alphabetic letter) used in the transcription, while Menota relies on Unicode to define each encoded sign.

So I am trying to tweak the Menota markup to fit my own goals. The first result of this ongoing (as of 29/12/2012) experiment is the following file:

menota.xml

As alignment between word and graphemes and between word and alphabetic letters is granted by the inclusion of Menota <me:facs> and <me:dipl> elements in the <w> element, the only alignment required here by means of <link> elements is between graphemes and alphabetic letters. These <link> elements are included in the menota.xml file. The XML file now also includes the <charDecl> element (but only for the Table of signs/graphemes; I don't know how to include also the Table of signs/alphabemes in the header of the same XML file).

The major tweaks I'm introducing or am about to introduce to face the three issues listed above are the following:

For issue 1 above: I'm inserting <g> and <c> and elements (for graphemes and alphabetic letters respectively) as children of the <w> element, so
- alignment between word and graphemes and between word and alphabetic letters is automatic
- I can still align graphemes and alphabetic letters through <link> elements (as each <g> and <c> has an ID).
Whether <g> and <c> elements can be children of <w> in Menota schemes is something (that seems very plausible, but) I still have to check.
For issues 2 and 3 above: I am still using two 'table of signs' (one for graphemes and for alphabetic letters). All <g> and <c> elements have @ref attributes pointing to elements defined in the first and second table respectively. Note that @ref attributes for graphemes and alphabemes point to different things: <g id="1.1" ref="grapheme_m"> <c id="1.1.1" ref="alphabeme_m">

Also note that in Linearisation A above, the one with three different XML files and no Menota markup, there was no need to differentiate between grapheme_m and alphabeme_m simply because each transcription (graphic, alphabetic) went into a different XML file, and each XML transcription file should include a different 'table of signs' in its TEI Header).

Roadmap

Project's outlining (January-February 2012)
Paternity leave (February 2012)
Modelling and methodological reflection, including presentations at conferences (March-October 2012)
Realisation of the actual digital scholarly edition (November 2012-February 2013)
Update (November 2014): since the end of my bourse at the Accademia dei Lincei (December 2012), my work on this prototype has significantly slowed down, and then, due to other work commitments, has ceased. Since 2012, all I could do about it consists in a series of seminars and conference talks to sketch out the methodological issues that have arisen (see TEI 2013, DiXiT 2014)

To-do list

Markup of other features (e. g. lines at alphabetic and linguistic layer; etc.)
Enforce resolution to a specific alphabetic sequence of 'non-systematic' abbreviations
Validate graphic.xml etc. against TEI and menota.xml against Menota scheme
~~Create table of signs/alphabemes~~
Change alphabemes' IDs to alph_a, alph_b etc.?
Also change graphemes' IDs to graph_a etc.?
Encode Codex Thuaneus (or Codex Pithoeanus, now Codex Parisinus 8071, late 9th century, today in the French National Library at Paris); IHRT
Prepend the TEI Header to the current XML files
~~Insert the Table of signs/graphemes and the Table of signs/alphabemes in the <charDecl> TEI Header for Linearisation A~~
Insert the Table of signs/graphemes and the Table of signs/alphabemes in the <charDecl> TEI Header for Linearisation B (Menota); solve problem with two tables of signs in the same TEI header [edit as of January 12, 2016: after discussing this issue at the 2013 Rome TEI Conference, I realized that there is no XML/TEI possible way to insert not only two, but even one table of signs in the TEI header, so for the Ursus Project I am now formatting this table as a simple CSV file, external to the XML/TEI source code but still parsed by the edition software]
~~Add 'display' column to both tables of signs~~
XSLT → HTML for display of each layer
Collate (CollateX? Juxta? other software?) alphabetic and linguistic levels
HTML/Javascript? for display of collation (XML/TEI Apparatus Criticus module?)
LMNL linearisation of three layers?
Standoff properties encoding of three layers?
Also encode scholarly print editions as witnesses?
Create 3rd linearisation, where alignment is realised through XPath
Complete GitHub publication of my work-in-progress project's code in paolomonella/vespa