This data set was created in a project led by Tuomas Heikkilä and Teemu Roos at the University of Helsinki. You can use it freely in your work as long as you attribute the data to its source (presently this blog). We are working on an article which, once appeared, can be referred to in published works.
We are (finally) pleased to announce our latest artificial manuscript tradition “Julius Caesar”, based on Shakespeare’s play with the same name. The tradition includes 64 manuscripts with 1626 words each on the average. It is therefore the most extensive artificial tradition available, and we hope to learn lots of interesting things about how our methods work and how they can be improved by applying them to the tradition.
The tradition was created by hand-copying a part of the play (mainly Act I, Scene II), and then repeatedly making new copies based on the earlier copies. The total number of manuscripts thus created was 95, out of which 31 were held back to simulate a more realistic scenario where not all of the manuscripts are extant. Furthermore, most of the remaining 64 manuscripts were partially deleted to make them appear as real fragmentary manuscripts.
We provide the data as pre-aligned plain text, as well as in Nexus format where each unique word per position is converted to a different letter.
Hence, the text format looks like this:
While the Nexus formatted file looks like this:
BEGIN DATA; DIMENSIONS NTAX = 64 NCHAR = 4917; FORMAT SYMBOLS = "acdefghiklmnpqrstwy" LABELS = LEFT; MATRIX Eh rrnrrrrnqnrqncdrnnrrr?????????????????? Hu rrnr?nr?qnrqncdnnnrrr????????????dr?rgc Ye ????????qnrqncdnnnrrr????????????dr?rgc Ad ????????qnrqncdnnnrrr????????????dr??qc Zi ????????qnrqncdnnnrrr????????????dr?rgc Vo ????????dnrdnhdnnnrrr????????????drrren
The correct stemma, according to which the manuscripts were copied, is below:
The nodes represent manuscripts (labelled with random labels), and edges indicate the exemplar-copy relations. Manuscripts held back from the data appear as points where two or more arrows touch without there being a node inbetween (note that the graph is a little bit misleading just above node Ki where the arrow overlaps with the arrow leading down to Oy,Id,Aq without there being an intermediate node representing a held-back manuscript). As you can see, there are many instances of contamination and multifurcation. The coloring scheme is chosen only for ease of interpretation — it is advisable to use the same colors in estimated stemmata.
We hope you will find it interesting to apply your favorite stemmatological methods to the data, and would be very happy to hear your comments.
Note: To properly open the file aligned text file in a spreadsheet application (such as Excel or OpenOffice), you may need to “import” the data instead of “opening” it. The import functionality can usually be found under File or Data menu title. Set the file type as “CSV” or “Text”. Select “Tab” as Field delimiter, and select the choice “None” for Text qualifier. Do not select the option “Treat consecutive delimiters as one”.
- Teemu Roos email@example.com
- Tuomas Heikkilä firstname.lastname@example.org