Shakespeare Corpora

Permanent URI for this collection

There are two ways of accessing the full and parsed sections of the plays. Though redundant, we thought it would be most helpful to users to be able to have different parsings at their fingertips; to be able to download and organize the content in different ways for text analysis access and use:

Shakespeare Corpus divided by Play, Act, and Scene Level -- a hierarchy of folders and content organized first by Acts (then play), Full Plays Scrubbed, Full Plays Not Scrubbed, and Scenes (then play)

Shakespeare Corpus divided by Play -- a hierarchy of folders and content organized first by Play, then by parsed sections of Acts, and Scenes. Each Play folder contains two text files of the play with character names and stage directions scrubbed and not scrubbed.

Our folder and file naming conventions are as follows:

FullPlaysnoSCRB-- This level includes all 39 of Shakespeare’s plays in the literary canon. They are taken directly from Gutenberg, and the only manipulation is that the boilerplates, the terms and conditions of Gutenberg, have been removed. Everything else, punctuation, capitalization, character names, and stage directions remain in these files.

FullPlays_SCRBnoCNnoSD-- This level still includes all 39 plays, but these have been scrubbed: removing digits, making the file all lowercase, and removing punctuation, excluding hyphens and word-internal apostrophes. Also, the character names and stage directions have been removed manually, as Lexos does not have that capability yet. So an example of the text would be like this-

Before scrubbing:

ADAM. Yonder comes my master, your brother.

ORLANDO. Go apart, Adam, and thou shalt hear how he will shake me up.

[ADAM retires]

OLIVER. Now, sir! what make you here?

After scrubbing:

yonder comes my master your brother

go apart adam and thou shalt hear how he will shake me


now sir what make you here

The reason for removing character names and stage directions is because leaving in the character names would skew the data in favor of the play in certain tests. However, if a character is referred to within the text itself, like Adam in the example above, their name is left in. In terms of stage directions, they are removed because oftentimes, they are not part of the original text, but were included by editors later.

Acts_SCRBnoCNnoSD-- This level breaks up each of the 39 plays into five acts each. This is split from the FullPlays_SCRBnoCNnoSD corpus, so these are still scrubbed as above, and do not include character names or stage directions.

Scenes_SCRB_noCNnoSD-- This level includes all of the scenes in all of the plays. The naming convention is as shown below:

Filename: PlaynameActNoSceneNo_NoCNnoSD, e.g., CymbelineActiiiSceneii_NoCNnoSD

These scenes are also scrubbed by the same conventions as the acts and full plays, and also have the character names and stage directions removed.


Scrubbing is a term that means to manipulate a text file in a certain way. There are many features to the scrubber tool in Lexos, but the most commonly used are to remove digits, punctuation, possibly excluding hyphens and word-internal apostrophes, and making the whole document lowercase.

Removing digits removes the numbers in the file, useful when the lines are marked, or if there are extraneous numbers that are not an original part of the document.

By ridding a document of its punctuation, Lexos prevents an analysis from considering “boy.” and “boy” as different words, due to the period of the former.

Also, in an original document, the words “boy” and “Boy” would be considered two different words, because of the capitalization of the latter. Making the document all lowercase gets rid of this inconsistency that would skew the data.

In the context of the Shakespeare corpus, hyphens and word-internal apostrophes are very important, and should not and were not scrubbed. Words like “to-night” and “over-careful” are used, and if the hyphens were removed, Lexos would interpret each word as two separate words, instead of one compound word.

A similar phenomenon would happen if word-internal apostrophes were scrubbed from the document. A word that was originally “dog’s” would become “dog s” and the letter ‘s’ would be counted as its own word. Also, Shakespeare sometimes used apostrophes to stand in for letters, such as “confin’d” for confined, “sav’d” for saved, etc.

The main function of the scrubber is to strip a document of its superfluous elements while still keeping the main text of the file, and keeping its integrity intact.


Additional resources and websites:

Shakespeare Query on Project Gutenberg

This online corpus that seeks to digitize and archive cultural works, and is a great resource when trying to find texts in the public domain. For this corpus, we used The Complete Works of William Shakespeare, that failed to include only The Two Noble Kinsmen and Edward III, which were pulled from other sources on Gutenberg.

Another website that includes all 39 plays in the Shakespeare canon. A useful feature of this website is one can jump directly to a specific scene in a play without having to scroll through the corpus until coming to the scene.

OpenSource Shakespeare

A website that includes 37 out of the 39 plays, excluding The Two Noble Kinsmen and Edward III. This is one of the more user-friendly websites, with many helpful features, such as seeing only an individual character’s lines/speeches and cues, or sorting the plays by genre, alphabetically, or chronologically.


Recent Submissions

Now showing 1 - 2 of 2
  • Item
    Shakespeare Corpus by Play
    (Wheaton College (Norton, Massachusetts)., 2017-10-05) Shakespeare, William.; Oliveira, Beth.; Oliveira, Beth.
    Shakespeare Corpus organized first by Play, then by Level (Acts and Scenes).
  • Item
    Shakespeare Corpus divided by Play, Act, and Scene Level
    (Wheaton College (Norton, Massachusetts)., 2017-09-29) Shakespeare, William.; Oliveira, Beth.; Oliveira, Beth.
    Shakespeare Corpus organized and grouped by Level: Acts, Scenes, and Full Plays.