Show simple item record

dc.contributor.authorShakespeare, William.
dc.contributor.editorOliveira, Beth.
dc.creatorOliveira, Beth.
dc.date.accessioned2017-10-05T14:22:54Z
dc.date.available2017-10-05T14:22:54Z
dc.date.issued2017-10-05
dc.identifier.urihttp://hdl.handle.net/11040/24451
dc.descriptionFullPlaysnoSCRB-- This level includes all 39 of Shakespeare’s plays in the literary canon. They are taken directly from Gutenberg, and the only manipulation is that the boilerplates, the terms and conditions of Gutenberg, have been removed. Everything else, punctuation, capitalization, character names, and stage directions remain in these files. FullPlays_SCRBnoCNnoSD-- This level still includes all 39 plays, but these have been scrubbed: removing digits, making the file all lowercase, and removing punctuation, excluding hyphens and word-internal apostrophes. Also, the character names and stage directions have been removed manually, as Lexos does not have that capability yet. So an example of the text would be like this- Before scrubbing: ADAM. Yonder comes my master, your brother. ORLANDO. Go apart, Adam, and thou shalt hear how he will shake me up. [ADAM retires] OLIVER. Now, sir! what make you here? After scrubbing: yonder comes my master your brother go apart adam and thou shalt hear how he will shake me up now sir what make you here The reason for removing character names and stage directions is because leaving in the character names would skew the data in favor of the play in certain tests. However, if a character is referred to within the text itself, like Adam in the example above, their name is left in. In terms of stage directions, they are removed because oftentimes, they are not part of the original text, but were included by editors later. Acts_SCRBnoCNnoSD-- This level breaks up each of the 39 plays into five acts each. This is split from the FullPlays_SCRBnoCNnoSD corpus, so these are still scrubbed as above, and do not include character names or stage directions. Scenes_SCRB_noCNnoSD-- This level includes all of the scenes in all of the plays. The naming convention is as shown below: Filename: PlaynameActNoSceneNo_NoCNnoSD, e.g., CymbelineActiiiSceneii_NoCNnoSD These scenes are also scrubbed by the same conventions as the acts and full plays, and also have the character names and stage directions removed. ---- Scrubbing is a term that means to manipulate a text file in a certain way. There are many features to the scrubber tool in Lexos, but the most commonly used are to remove digits, punctuation, possibly excluding hyphens and word-internal apostrophes, and making the whole document lowercase. Removing digits removes the numbers in the file, useful when the lines are marked, or if there are extraneous numbers that are not an original part of the document. By ridding a document of its punctuation, Lexos prevents an analysis from considering “boy.” and “boy” as different words, due to the period of the former. Also, in an original document, the words “boy” and “Boy” would be considered two different words, because of the capitalization of the latter. Making the document all lowercase gets rid of this inconsistency that would skew the data. In the context of the Shakespeare corpus, hyphens and word-internal apostrophes are very important, and should not and were not scrubbed. Words like “to-night” and “over-careful” are used, and if the hyphens were removed, Lexos would interpret each word as two separate words, instead of one compound word. A similar phenomenon would happen if word-internal apostrophes were scrubbed from the document. A word that was originally “dog’s” would become “dog s” and the letter ‘s’ would be counted as its own word. Also, Shakespeare sometimes used apostrophes to stand in for letters, such as “confin’d” for confined, “sav’d” for saved, etc. The main function of the scrubber is to strip a document of its superfluous elements while still keeping the main text of the file, and keeping its integrity intact.
dc.description.abstractShakespeare Corpus organized first by Play, then by Level (Acts and Scenes).
dc.description.abstractThis zip file downloads a hierarchy of folders and content organized first by Play, then by parsed sections of Acts, and Scenes. Each Play folder contains two text files of the play with character names and stage directions scrubbed and not scrubbed.
dc.language.isoen_USen_US
dc.publisherWheaton College (Norton, Massachusetts).en_US
dc.relation.urihttp://lexos.wheatoncollege.edu
dc.subjectstylometry
dc.subjectcomputational thinking
dc.subjecttext analysis
dc.subjectLexomics
dc.subject.lcshShakespeare, William, 1564-1616.
dc.titleShakespeare Corpus by Playen_US
dc.typeOtheren_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

  • Shakespeare Corpora
    This collection features all thirty-nine of Shakespeare's plays in the literary canon.

Show simple item record