A number of versions have been widely used for research on the history of English. ARCHER 1 dates from 1993, ARCHER 2 from 2004-5. ARCHER 3.1 was completed in summer 2006. The current version is known as ARCHER 3.2 and was completed in 2013.
ARCHER 3.1 was coordinated at Heidelberg from 2004 to 2008 by Marianne Hundt (now in Zurich) and Nadja Nesselhauf. It incorporated a partial clean-up and correction of many files and deletion of duplicate files. The materials of ARCHER 3.1 became accessible to the consortium in summer 2006 with
- a new labelling system for filenames (see below),
- convenient distribution formats, e.g. separate folders for each genre-period-variety combination (955 files in 80 folders) or all in one folder,
- revised word counts,
- a suggested protocol for future compilation of texts.
The corpus was made up as follows.
Text categories and filenames
- Eight genres: d = drama, f = fiction, h = sermons, j = journal or diaries, m = medicine, n = news, s = science, x = letters.
- Seven periods: 2 = 1650-99, 3 = 1700-49, 4 = 1750-99, 5 = 1800-49, 6 = 1850-99, 7 = 1900-49, 8 = 1950-99.
- Two varieties: b = British, a = American.
NB. There will be eleven genres and eight periods in ARCHER 3.2.
Filenames were consistently 8+dot+3 characters (always lower case), according to the formula nnnnabcd.gpv, where
- nnnn = year
- abcd = abbreviation of author’s surname (usually first four letters)
- g = genre (abbreviations as shown)
- p = period
- v = variety
For example, the file 1650penn.j2b is a text written in 1650 by William Penn, is a journal, belongs to the period 1650-99, and is British. (The naming convention has been transparently altered in version 3.2 so that the extension becomes part of the filename, e.g. 1650penn_j2b.) All files in version 3.1 were date-stamped as 31 July 2006 03:10.
Word count (British English texts)
Word count (American English texts)
For more information on ARCHER 3.1, see the CoRD entry for ARCHER.
The current phase of the project, ARCHER 3.2, is intended to enhance the usefulness of the corpus in a number of ways over the period 2008-11 and continuing 2011-13. For the first time, searches can be carried out over the internet. David Denison and Nuria Yáñez-Bouza are coordinating the work being carried out in the international consortium. We had a British Academy Small Research Grant from 1 June 2010 to 31 December 2011 (extended to 1 June 2012) to help fund some of the work at Manchester. For a graphic summary of improvements, see Nuria Yáñez-Bouza’s poster (PDF, 195 KB) (based on that presented at ICAME 32 and now updated to October 2013).
Text categories and filenames
- Twelve genres: a = advertising, d = drama, f = fiction, h = sermons, j = journals, l = legal, m = medicine, n = news, p = early prose, s = science, x = letters, y = diaries.
- Eight periods: 1 = 1600-49, 2 = 1650-99, 3 = 1700-49, 4 = 1750-99, 5 = 1800-49, 6 = 1850-99, 7 = 1900-49, 8 = 1950-99.
- Two varieties: b = British, a = American.
The four extra genres in this version result from the separation of diaries and (travel/political) journals, a special classification ‘early prose’, plus the addition of advertising and legal texts.
Filenames are consistently 8+underscore+3 characters (always lower case), according to the formula nnnnabcd_gpv, where
nnnn = year
abcd = abbreviation of author’s surname (usually first four letters)
g = genre (abbreviations as shown)
p = period
v = variety
For example, the file 1650penn_j2b is a text written in 1650 by William Penn, is a journal, belongs to the period 1650-99, and is British.
- Two tables of British and American English texts classified by genre and period (JPEG, 122 KB)
- Total is 3.3m (3,298,080 words = 1,957,499 British + 1,340,581 American)
- For distribution of both files and words in spreadsheet form see under Documentation
The word counts tabulated above refer to the XML and plain text versions available at consortium universities. Counts in online versions can be up to 15% higher because they measure tokens rather than words: thus the single words don’t, she’s, we’ve, they’d and John’s each count as two tokens, since the contracted negative, contracted verbs and the possessive suffix ‘s are counted as tokens for tagging purposes. Punctuation marks are tokens too. Note that the online software search engines makes it easy to see normalised frequencies overall and by genre.
The master-copy of the ARCHER corpus is now fully XML-compliant (Zurich) and with TEI-compliant headers (Zurich/Manchester). The following versions are available at consortium universities: XML (UTF-8 character set) and plain text (Latin-1). The version of the XML files currently online at Lancaster can be searched with CQPweb software. Coming soon is a CQPweb version tagged with the CLAWS-6 tagset (Lancaster/Leicester) after normalisation of spelling with Alistair Baron’s VARD 2 program (Lancaster/Freiburg). Another online version, VARDed and tagged with the Treebank tagset and parsed with a dependency parser (Zurich), is expected to follow shortly afterwards.
ARCHER is essentially an original-spelling corpus, albeit the spelling of published editions. All versions share the same text and bibliographic and non-linguistic mark-up, but future online versions at Lancaster and at Zurich will differ in their linguistic mark-up.
There has been a great deal of work to improve the accuracy of the corpus text, the bibliographic documentation (Manchester) and the consistency of file structure and mark-up (Manchester/Zurich). Some files up to version 3.1 had sporadically normalised spellings, with the spelling as it appeared in the edition inserted in a comment; in version 3.2 it is always the edition’s spelling which appears in the main text. We have attempted to mark speakers in fiction and characters in drama consistently (Zurich).
Compilation of new texts
Some text from versions 1 and 2 which had to be excluded from version 3.1 have now been restored, and consortium members have contributed new texts to help fill remaining gaps in coverage in national varieties, genres and periods. Note that some letters are included in 3.2 from writers who are not published authors. Team locations specify where the bulk of the work was carried out, even if members have since moved.
|British||early prose||1600-49||Freiburg (v2)|
|British||letters||1650-99, 1700-49, 1750-99||Uppsala/Helsinki/Freiburg|
|American||drama||1800-49, 1900-49||Uppsala/Helsinki (v2)|
|American||fiction||1800-49, 1900-49||Uppsala/Helsinki (v2)|
|American||legal||1750-99||NAU & USC (v1)|
|American||news||1800-49, 1900-49||Uppsala/Helsinki (v2)|
|American||science||1800-49, 1900-49, 1950-99||Michigan|
|American||sermons||1750-99, 1850-99, 1950-99||Zurich|
The following consortium teams have made further contributions:
- Freiburg: VARDing
- Lancaster: VARDing, POS-tagging, CQPweb
- Leicester: POS-tagging
- Manchester: database management; documentation; text revision and annotation; design of TEI headers
- Trier: word count and word list scripts
- Zurich: XML-compliant format in text and headers; consistency of mark-up; tagged and parsed version
The ARCHER team would like to gratefully acknowledge the contribution of former research members: Marianne Hundt (Zurich, until November 2013), Arja Nurmi (Helsinki, until July 2012), Anna Rosen (Bamberg, until October 2011), Richard W. Bailey (Michigan, †2011), Chris Palmer (Michigan, until 2010). We are also very grateful to various research assistants at the consortium departments: Ole Schützler at Bamberg; Michael Percillier at Freiburg; Tuula Chezek at Helsinki; Lauri Hiltunen and Marije van Hattum at Manchester; Taryn Hakala at Michigan; assistants in the VLCG research group at Santiago de Compostela under the coordination of Paula Rodríguez-Puente; Moira Kindlimann and Pius Meyer at Zurich; Melanie Röthlisberger at Manchester and Zurich.
The project is continuing with the aim of further improving the corpus.
- Fiction texts will have mark-up for individual speakers and to distinguish direct speech from narration (Santiago).
- We are exploring the possibility of a parallel-text version in which users could switch between original spelling and normalised, present-day spelling.
There are some natural limitations to material available in some genres in early periods, but two gaps in American coverage should be filled in version 3.3, as shown:
|British||all except d, p, l||1600-49||diaries and journals at Manchester|
Manchester and Bamberg
Genres: a = advertising, d = drama, f = fiction, h = sermons, j = journals, l = legal, m = medicine, n = news, p = early prose, s = science, x = letters, y = diaries.
We are compiling a list of errors in version 3.2 for correction in ARCHER 3.3. Please notify us of possible items by an email to firstname.lastname@example.org with the subject header ‘ARCHER text correction’.