Paolo Monella Markup workshop From ASCII to semantic annotation Ghislieri College, Pavia University 2022

Paolo Monella Laboratorio di markup Da ASCII alla marcatura semantica Collegio Ghislieri, Università di Pavia 2022

Details

Title	Markup workshop: from ASCII to semantic annotation
Event	Digital Humanities in pratica. Il ruolo delle tecnologie digitali nello studio, nella ricostruzione e nella preservazione del mondo antico (programme, details)
When	Friday February 25, 2022, 11.30-13.15 CET
Where	Aula Barbara Rossi, Ghislieri College, Pavia, Italy
Organization	University of Pavia, Italy
Language	Italian

Dettagli

Titolo	Laboratorio di markup: da ASCII alla marcatura semantica
Evento	Digital Humanities in pratica. Il ruolo delle tecnologie digitali nello studio, nella ricostruzione e nella preservazione del mondo antico (locandina, dettagli)
Quando	Venerdì 25 febbraio 2022, 11.30-13.15 CET
Dove	Barbara Rossi Hall, Collegio Ghislieri, Pavia
Organizzazione	Università di Pavia
Lingua	Italiano

Abstract

Un workshop pratico per studenti di discipline umanistiche dell’Università di Pavia sul markup testuale, dai charset alla marcatura strutturale e semantica in TEI XML.

Abstract

A hands-on workshop for humanities students of the University of Pavia on textual markup, ranging from charsets to structural and semantic annotation in TEI XML.

Streaming e registrazione video

Sul canale YouTube del Collegio Ghislieri

Video streaming and registration

On the YouTube channel of the Ghislieri College

Slide di Monica Berti

Monica Berti, Linked Open Data per il mondo antico, su Zenodo

Monica Berti slides

Monica Berti, Linked Open Data per il mondo antico, on Zenodo

Tools and software

Workshop attendants will use their own computers
Internet connection will be needed
Software
1. For the first part of the workshop (letter A in the instructions below): the standard basic plain text editor pre-installed in one’s operating system (TextEdit on Mac, Notepad on Windows)
2. For the next parts: installing an XML editor is suggested (though not strictly necessary). Those who do not have an XML editor installed (such as Oxygen), can download the free open source software XML Copy Editor
Account
- If we will have time to use Recogito, each attendant will need a (free) Recogito account

Strumenti e software

I corsisti useranno i propri computer
Servirà una connessione internet
Software
1. Per la prima parte del workshop (lettera A nelle istruzioni sotto): l’editor basico di testo semplice preinstallato nel proprio sistema operativo (TextEdit su Mac, Notepad su Windows)
2. Per le altre parti, è suggerita (ma non strettamente necessaria) l’installazione di un editor XML. Chi non abbia già un editor XML installato (such as Oxygen), può scaricare gratuitamente il programma open source XML Copy Editor
Account
- Se avremo tempo di usare Recogito, servirà ad ogni corsista un account Recogito (gratuito)

Workshop plan

Piano del workshop

11:30	11:55	A. From charset to markupA. Dal charset al markup
11:55	12.15	B. Structural markupB. Markup strutturale
12:15	12.20	C. Paste the code within a valid ‘frame’ TEI fileC. Incolla il codice all’interno di un file TEI valido
12:20	12:25	D. Check and visualizeD. Controlla e visualizza
12:25	12:40	E. Entirely manual semantic markupE. Markup semantico interamente manuale
12:40	12:50	F. Manual semantic markup with RecogitoF. Markup semantico manuale con Recogito
–:–	–:–	(G. Marcatura semantica automatica con Recogito)(G. Automatic semantic markup with Recogito)
12:50	13:15	H. Manual linguistic markupH. Marcatura linguistica manuale

TEI XML files

File XML TEI

From the incipit of Caesar, De bello Gallico (book 1, chapter 1, sections 1 and 2). Text from Perseus, whose source print edition is: C. Julius Caesar. C. Iuli Commentarii Rerum in Gallia Gestarum VII A. Hirti Commentarius VII. T. Rice Holmes. Oxonii. e Typographeo Clarendoniano. 1914. Scriptorum Classicorum Bibliotheca Oxoniensis.

To download each file, right-click on the link and choose “Save as” (or the corresponding command on your system):

Dall’incipit del De bello Gallico di Cesare (libro 1, capitolo 1, sezioni 1 e 2). Il testo è tratto da Perseus, la cui edizione-fonte a stampa è: C. Julius Caesar. C. Iuli Commentarii Rerum in Gallia Gestarum VII A. Hirti Commentarius VII. T. Rice Holmes. Oxonii. e Typographeo Clarendoniano. 1914. Scriptorum Classicorum Bibliotheca Oxoniensis.

Tutti i file si possono scaricare cliccandovi sopra col tasto destro e dando “Salva con nome” (o comando simile):

M6 (cornice/frame)

M8 (Perseus)

M10

Workshop instructions

A. From charset to markup

A1. Open the standard basic plain text editor pre-installed in your operating system (TextEdit on Mac, Notepad on Windows)
A2. Follow the instructions during the workshop… and reinvent charsets and markup!

B. Structural markup

B1. Download file M1
- It includes the plain text of (part of) the incipit of the De bello Gallico
B2. Mark book header and paragraphs
- Mark
  - “COMMENTARIUS PRIMUS” as a header (tag <head>)
  - Each of the following three paragraphs with the paragraph tag <p>:
    - “Gallia est omnis divisa… Galli appellantur”
    - “Hi omnes lingua, institutis, legibus inter se differunt…”
    - “Apud Helvetios longe nobilissimus fuit et ditissimus Orgetorix…”
- The result should look like file M2
  - Our file now has one <head> and three <p> elements
B3. Add two <div> tags for chapters:
- One wrapping the “Gallia est…” and the “Hi omnes…” paragraphs
- A second one wrapping the “Apud Helvetios…” paragraph only
- The result should look like file M3
B4. Add one more <div> (for book 1) wrapping up everything
- The result should look like file M4
  - This file has the <div> for book 1 as “root” element.
B5. Add attributes to <div> elements (example: <div type="chapter" n="99">) as follows:

Element	Attribute	Value
The `<div>` for book 1 (the one wrapping our whole text)	`type`	`book`
	`n`	`1`
The `<div>` for the first chapter (the one with “Gallia est…” and “Hi omnes…”)	`type`	`chapter`
	`n`	`1`
The `<div>` for the second chapter (the one with “Apud Helvetios…”)	`type`	`chapter`
	`n`	`2`

B6. The result should look like file M5
- This file has all the structural markup that we want

C. Paste the code within a valid ‘frame’ TEI file

C1. Download file M6 as a separate file, and open it
- M6 has a <teiHeader> with the metadata (author, work…), as well as empty <text> and <body> elements
  - Those elements are needed to make our file valid TEI
  - M6 serves as a ‘frame’ to our code
C2. Paste your TEI code within the M6 ‘frame’ file
- Copy/paste your code (the text that you marked) in the M6 file, right after the comment 
- Save your edited M6 file (including your own code)
  - This is our working file now
- The result should look like file M7

D. Check and visualize

D1. Check if your working document is valid TEI
- Go to the TBE Validation Service with your browser
- Upload your working document: “Sfoglia” / “upload” (it may take some seconds)
- Validate it: “Validate input as autonomous TEI fragment” / “Validate!”
- If the “Validation Result” box turns green, the document is valid
  - If it turns red, it is invalid
D2. If you are curious about “real-world” TEI encoding practices, take a peek at (large) file M8
- This is the actual TEI XML file of the De bello Gallico from Perseus
- It has a more fine-grained structural markup (more <div>s etc.), but still no semantic annotation
D3. Also see the Perseus Website HTML visualization of the incipit
- You can click on the “XML” orange button below the text to see the TEI XML source code for that portion of the text

E. Entirely manual semantic markup

E1. Let us start adding semantic markup manually. Go to Ancient Places in Pleiades to find the LOD URI of place name “Gallia”
- Click on “Advanced Search” (top right on the page)
- Type “Gallia” in the “Title” field (not in the “Search Text” field)
- Choose the best match and click on it
- Copy the value of the “Canonical URI for this page”
  - It is a URI starting with https://pleiades…
  - This is our LOD URI for Gaul
  - (Don’t close this page: we’ll need that URI)
E2. Link the word “Gallia” in your TEI working file to the LOD URI
- Mark the word “Gallia” as in the following example (edit as appropriate): <placeName ref="https://pleiades.stoa.org/places/462086">Agrigentum</placeName>
E3. Optional (If you have time, you can repeat the same procedure to mark up the names of populations, such as Belgae or Helvetii)
- In addition to Ancient Places in Pleiades, you can also search in other “triple stores” such as Wikidata
- In Wikidata, the LOD URI is the URL (web address) of each page and have this format: https://www.wikidata.org/wiki/Q...
E4. Link the person name “Orgetorix” to its LOD URI
- Search for “Orgetorix” in Wikidata
- Get its LOD URI
- Mark the word “Orgetorix” in your file like in the following code samples: <persName ref="https://www.wikidata.org/wiki/Q8833">Maecenatem</persName> or <persName ref="https://pir.bbaw.de/id/8672">Maecenatem</persName>
E5. The result should look like file M9

F. Manual semantic markup with Recogito

F1. Login on Recogito
- Create a free account if you don’t have one
F2. Load file M7 to your Recogito workspace
- “New” (blue button, top left)
- “File upload” → choose and upload M7 from your computer
F3. Ask Recogito to assist you in marking up the place name “Gallia”
- Double-click on M7 from the Recogito file list
  - The M7 file opens up in Recogito (markup is hidden)
  - Keep the default settings:
    - “Annotation mode: Normale/Normal”
    - “Colori/Colors: Per stato di verifica”
- Highlight word “Gallia” (from the sentence “Gallia est omnis divisa…”)
- Click on “Luogo/Place”
  - Recogito suggests a LOD URI (from Pleiades)
- Check if it is correct
- If it is, click on “OK” or “OK e successivo”
  - Recogito asks: “There is 1 more un-annotated occurrence of Gallia in the text. Do you want to re-apply this annotation?”
  - I suggest to answer “Yes”
F4. Mark up population names (such as “Belgae”)
- (We are assuming that M7 is still open in Recogito)
- Again, click on “Luogo/Place”
- Issue: Recogito suggests to link “Celtae” and “Galli” (populations) to the LOD URI of place “Gallia”. Is this what we want? We can talk about it…
F5. Mark up the person name “Orgetorix”
- (We are assuming that M7 is still open in Recogito)
- This time, click on “Persona/Person”
- …but it does not suggest a LOD URI
F6. Recogito data visualizations
- Click on the “Annotation Statistics” icon on top
- Click on the “Map View” icon on top
F7. Download the annotated file
- Click on the download icon on top
- Choose the format
  - We can discuss what those formats mean
  - But on this occasion, we’ll choose “Annotated Document → TEI → TEI/XML” (blue button on the right)

(G. Automatic semantic markup with Recogito)

(Not working as of 24.02.2022)

G1. Login: same as F1
G2. File upload: same as F2
G3. Ask Recogito to automatically parse the whole file:
- Select M7 from the Recogito file list
- “Options” (top right)
- “Named Entity Recognition”
- “Herodotus Latin NER” in the pop up window
- “Start NER” (bottom of the pop up window)
G4. Check (manually) if the LOD URIs are correct
- By clicking on annotated words in Recogito
G5. File download: same as F7

H. Manual linguistic markup

H1. Mark word “Gallia” with tag <w> (for word) as in the wollowing code sample:

<placeName ref="https://pleiades.stoa.org/places/462086">
    <w>Agrigentum</w>
</placeName>

H2. Find the LOD URI of lemma “Gallia, -ae” in the LiLa Query Interface:
- Click on “Lemma”
- Search “gallia” in the “Search lemma” field
- Click on the “Open data sheet” icon
- Copy the LOD URI of the lemma (LiLa LOD URIs have this format: http://lila-erc.eu/data/id/lemma/...)
H3. Give <w> ( for word) the attribute lemma with the LOD URI of the lemma as value, as in the following code sample (edit as appropriate):

<placeName ref="https://pleiades.stoa.org/places/462086">
    <w lemma="http://lila-erc.eu/data/id/lemma/2027">Agrigentum</w>
</placeName>

H4. The result should look like file M10
H5. LiLa provides automatic linguistic annotation, but Marco Passarotti and Francesco Mambrini will cover this in their workshop tomorrow

A. Dal charset al markup

A1. Apri l’editor di testo semplice preinstallato nel tuo sistema operativo (TextEdit su Mac, Notepad su Windows)
A2. Segui e istruzioni durante il workshop… e reinventa charset e markup!

B. Markup strutturale

B1. Scarica il file M1
- Comprende il testo semplice di parte dell’incipit del De bello Gallico di Cesare
B2. Marca titolo e paragrafi
- Marca (cioè circonda coi tag TEI XML):
  - “COMMENTARIUS PRIMUS” come titolo (tag <head>)
  - Ciascuno dei seguenti tre paragrafi, col tag <p> (per paragrafo):
    - “Gallia est omnis divisa… Galli appellantur”
    - “Hi omnes lingua, institutis, legibus inter se differunt…”
    - “Apud Helvetios longe nobilissimus fuit et ditissimus Orgetorix…”
- Il risultato dovrebbe somigliare al file M2
  - Il nostro file adesso ha un elemento <head> e tre elementi <p>
B3. Aggiungi due <div> per i capitoli:
- Uno, che circondi i paragrafi “Gallia est…” e “Hi omnes…”
- Un secondo, che circondi solo il paragrafo “Apud Helvetios…”
- Il risultato dovrebbe somigliare al file M3
B4. Aggiungi un altro <div> (per il libro 1) che circondi tutto
- Il risultato dovrebbe somigliare al file M4
  - Questo file ha il <div> del libro 1 come elemnto “radice” (“root”)
B5. Aggiungi attributi agli elementi <div> (esempio: <div type="chapter" n="99">) come indicato di seguito:

Elemento	Attributo	Valore
Il `<div>` del libro 1 (quello che avvolge tutto il nostro testo)	`type`	`book`
	`n`	`1`
Il `<div>` per il primo capitolo (quello con “Gallia est…” e “Hi omnes…”)	`type`	`chapter`
	`n`	`1`
Il `<div>` per il secondo capitolo (quello con “Apud Helvetios…”)	`type`	`chapter`
	`n`	`2`

B6. Il risultato dovrebbe somigliare al file M5
- Questo file ha tutto il markup strutturale che vogliamo

C. Incolla il codice all’interno di un file TEI valido

C1. Scarica il file M6 come file separato ed aprilo
- M6 ha un <teiHeader> coi metadati (autore, opera…), ed elementi <text> e <body>
  - Questi elementi sono necessari per rendere il file valido secondo la TEI
  - Il file Serve come ‘cornice’ per il nostro codice
C2. Incolla il tuo codice TEI nel file ‘cornice’ M6
- Copia/incolla il tuo codice (il testo che hai marcato) all’interno del file M6, subito dopo il commento 
- Salva la versione editata da te di M6
  - D’ora in poi questo sarà il nostro file di lavoro
- Il risultato dovrebbe somigliare al file M7

D. Controlla e visualizza

D1. Controlla se il tuo documento di lavoro è un file TEI valido
- Vai sul TBE Validation Service col tuo browser
- Carica il tuo documento di lavoro: “Sfoglia” / “upload” (potrebbe volerci un po’)
- Validalo: “Validate input as autonomous TEI fragment” / “Validate!”
- Se il riquadro “Validation Result” diventa verde, il documento è valido
  - Se diventa rosso, è invalido
D2. Se hai voglia di dare un’occhiata alle pratiche di codifica TEI nel ‘mondo reale’, guarda il (grosso) file M8
- Si tratta dell’effettivo file XML TEI del De bello Gallico di Perseus
- Ha un markup strutturale più granulare (più <div> etc.), ma ancora nessuna marcatura semantica
D3. Vedi anche la visualizzazione HTML sul sito Perseus dell’incipit
- Puoi cliccare sul pulsante arancione “XML” sotto il testo per vedere il codice sorgente TEI XML di quella porzione di testo

E. Markup semantico interamente manuale

E1. Iniziamo ad aggiungere markup semantico manualmente. Vai su Ancient Places in Pleiades per trovare l’URI LOD del toponimo “Gallia”
- Clicca su “Advanced Search” (in alto a destra nella pagina)
- Scrivi “gallia” nel campo “title” (non nel campo “Search Text”)
- Scegli il risultato migiore e cliccalo
- Annota il valore del “Canonical URI for this page”
  - Si tratta di un URI che inizia con https://pleiades…
  - Questo è il nostro URI LOD per la Gallia
  - (Non chiudere la pagina: quell’URI ci servirà)
E2. Collega la parola “Gallia” nel file TEI all’URI LOD
- Marca la parola “Gallia” come nell’esempio seguente (adattalo al nostro caso): <placeName ref="https://pleiades.stoa.org/places/462086">Agrigentum</placeName>
E3. Facoltativo (Se hai tempo, puoi ripetere la stessa procedura per marcare i nomi di popolazioni come i belgi o gli elvezi)
- Oltre a Ancient Places in Pleiades, puoi cercare anche in altri “triple stores” come Wikidata
- In Wikidata, l’URI LOD è l’URL (indirizzo web) della pagina, ed ha questa forma: https://www.wikidata.org/wiki/Q...
E4. Collega l’antroponimo “Orgetorix” al suo URI LOD
- Cerca “orgetorix” su Wikidata
- Annota il suo LOD URI
- Marca la parola “Orgetorix” nel file TEI come in questi esempi: <persName ref="https://www.wikidata.org/wiki/Q8833">Maecenatem</persName> o <persName ref="https://pir.bbaw.de/id/8672">Maecenatem</persName>
E5. Il risultato dovrebbe somigliare al file M9

F. Markup semantico manuale con Recogito

F1. Fai il login su Recogito
- Crea un account gratuito, se non ne hai uno
F2. Carica il file M7 sul tuo spazio di lavoro Recogito
- “New” (bottone blue, in alto a sinistra)
- “File upload” → scegli e carica M7 dal tuo computer
F3. Fatti assistere da Recogito nel marcare il toponimo “Gallia”
- Fai doppio click su M7 dalla lista di file di Recogito
  - Il file M7 si apre in Recogito (il markup non è visibile)
  - Mantieni le impostazioni predefinite
    - “Annotation mode: Normale/Normal”
    - “Colori/Colors: Per stato di verifica”
- Evidenzia la parola “Gallia” (dalla frase “Gallia est omnis divisa…”)
- Clicca su “Luogo/Place”
  - Recogito suggerisce un URI LOD (da Pleiades)
- Controlla se è corretto
- Se lo è, clicca su “OK” o “OK e successivo”
  - Recogito chidede: “There is 1 more un-annotated occurrence of Gallia in the text. Do you want to re-apply this annotation?”
  - Suggerisco di rispondere di sì
F4. Marca gli etnonimi (come “Belgae”)
- (Assumendo che il file M7 sia ancora aperto in Recogito)
- Anche qui clicca su “Luogo/Place”
- Questione:
- Questione: Recogito suggerisce di collegare “Celtae” e “Galli” (etnonimi, nomi di popolazioni) all’URI LOD del luogo “Gallia”. È quel che vogliamo? Se ne può discutere…
F5. Marca l’antroponimo “Orgetorix”
- (Assumendo che il file M7 sia ancora aperto in Recogito)
- Stavolta, scegli “Persona/Person”
- …ma Recogito non suggerisce un URI LOD
F6. Visualizzazione dei dati in Recogito
- Clicca sull’icona “Annotation Statistics” in alto
- Clicca sull’icona “Map View” in alto
F7. Scarica il file con le nuove annotazioni
- Clicca sull’icona del download
- Scegli il formato
  - Possiamo discutere cosa significano i vari formati proposti
  - Per in questo caso, scegliamo “Annotated Document → TEI → TEI/XML” (bottone blu a destra)

(G. Marcatura semantica automatica con Recogito)

(Non funzionante al 24.02.2022)

G1. Login: come F1
G2. Caricamento del file: come F2
G3. Chiedi a Recogito di analizzare e marcare automaticamente l’intero file:
- “New” (bottone blue, in alto a sinistra)
- Scegli M7 dall’elenco di file in Recogito
- “Options” (in alto a destra)
- “Named Entity Recognition”
- “Herodotus Latin NER” nella finestra che si è aperta
- “Start NER” (nella parte inferiore della finestra che si è aperta)
G4. Controlla (manualmente) se gli URI LOD sono corretti
- Cliccando sulle parole annotate in Recogito
G5. Download del file: come F7

H. Marcatura linguistica manuale

H1. Marca la parola “Gallia” col tag <w> (per word) come in questo esempio:

<placeName ref="https://pleiades.stoa.org/places/462086">
    <w>Agrigentum</w>
</placeName>

H2. Trova l’URI LOD del lemma “Gallia, -ae” nella LiLa Query Interface:
- Clicca su “Lemma”
- Cerca “gallia” nel campo “Search lemma”
- Clicca sull’icona “Open data sheet”
- Annota l’URI LOD del lemma (gli URI LOD di LiLa hanno questa forma: http://lila-erc.eu/data/id/lemma/...)
H3. Dai a <w> l’attributo lemma, avente come valore l’URI LOD del lemma, come nell’esempio seguente (adattalo al nostro caso):

<placeName ref="https://pleiades.stoa.org/places/462086">
    <w lemma="http://lila-erc.eu/data/id/lemma/2027">Agrigentum</w>
</placeName>

H4. Il risultato dovrebbe somigliare al file M10
H5. C’è un modo per automatizzare l’annotazione linguistica con LiLa, ma ne parleranno Marco Passarotti e Francesco Mambrini nel loro workshop domani

Map

Visualizza mappa ingranditaExpand map