|dc.description||FullPlaysnoSCRB-- This level includes all 39 of Shakespeare’s plays in the literary canon. They are taken directly from Gutenberg, and the only manipulation is that the boilerplates, the terms and conditions of Gutenberg, have been removed. Everything else, punctuation, capitalization, character names, and stage directions remain in these files.
FullPlays_SCRBnoCNnoSD-- This level still includes all 39 plays, but these have been scrubbed: removing digits, making the file all lowercase, and removing punctuation, excluding hyphens and word-internal apostrophes. Also, the character names and stage directions have been removed manually, as Lexos does not have that capability yet. So an example of the text would be like this-
ADAM. Yonder comes my master, your brother.
ORLANDO. Go apart, Adam, and thou shalt hear how he will shake me
up. [ADAM retires]
OLIVER. Now, sir! what make you here?
yonder comes my master your brother
go apart adam and thou shalt hear how he will shake me
now sir what make you here
The reason for removing character names and stage directions is because leaving in the character names would skew the data in favor of the play in certain tests. However, if a character is referred to within the text itself, like Adam in the example above, their name is left in. In terms of stage directions, they are removed because oftentimes, they are not part of the original text, but were included by editors later.
Acts_SCRBnoCNnoSD-- This level breaks up each of the 39 plays into five acts each. This is split from the FullPlays_SCRBnoCNnoSD corpus, so these are still scrubbed as above, and do not include character names or stage directions.
Scenes_SCRB_noCNnoSD-- This level includes all of the scenes in all of the plays. The naming convention is as shown below:
Filename: PlaynameActNoSceneNo_NoCNnoSD, e.g., CymbelineActiiiSceneii_NoCNnoSD
These scenes are also scrubbed by the same conventions as the acts and full plays, and also have the character names and stage directions removed.
Scrubbing is a term that means to manipulate a text file in a certain way. There are many features to the scrubber tool in Lexos, but the most commonly used are to remove digits, punctuation, possibly excluding hyphens and word-internal apostrophes, and making the whole document lowercase.
Removing digits removes the numbers in the file, useful when the lines are marked, or if there are extraneous numbers that are not an original part of the document.
By ridding a document of its punctuation, Lexos prevents an analysis from considering “boy.” and “boy” as different words, due to the period of the former.
Also, in an original document, the words “boy” and “Boy” would be considered two different words, because of the capitalization of the latter. Making the document all lowercase gets rid of this inconsistency that would skew the data.
In the context of the Shakespeare corpus, hyphens and word-internal apostrophes are very important, and should not and were not scrubbed. Words like “to-night” and “over-careful” are used, and if the hyphens were removed, Lexos would interpret each word as two separate words, instead of one compound word.
A similar phenomenon would happen if word-internal apostrophes were scrubbed from the document. A word that was originally “dog’s” would become “dog s” and the letter ‘s’ would be counted as its own word. Also, Shakespeare sometimes used apostrophes to stand in for letters, such as “confin’d” for confined, “sav’d” for saved, etc.
The main function of the scrubber is to strip a document of its superfluous elements while still keeping the main text of the file, and keeping its integrity intact.||