Poe in Cyberspace

Edgar Allan Poe Review: Fall 2000
Nuclear arms and Poe e-texts: The matter of verification

The re-issue in electronic form of standard Poe collections is always welcome. The first desideratum is that the electronic text correspond page for page to the printed original so that the scholar can cite the reliable printed text while working with the versatile electronic text. Furthermore, the printed and electronic editions should correspond word for word to such an extent that operations with the electronic text will pertain with equal validity to the printed original. In looking at three recent electronic Poe texts based on printed editions we will encounter important differences in the degree to which these aims have been met.

Image scanning of historical Poe editions from the pre-copyright era is no longer a rarity. Electronic editors have worked with the Harrison, "Raven," Whitty, and other editions in the public domain. Regrettably, the Mabbott-Pollin edition, still in copyright, is not available in electronic form. But one limitation of image scans, which are pictures, is that they are not easily converted into electronic texts, which are made of words. Despite improvements in optical character recognition (OCR) and spell checking software, only 98% to 99% of such page images can be converted reliably into text by computer methods. The remaining 1% or 2% requires laborious human checking and editing to achieve true verification, the area in which electronic texts vary most in quality.

Three new Poe e-texts:

Three significant new Poe e-texts based on historical editions have appeared recently. In alphabetical order, they are:

  1. The Early American Fiction archive of the University of Virginia and Chadwyck-Healey, containing works of many American authors before 1850, including electronic versions of all of Poe's fiction published in book form during his lifetime. Although the full series is available only by subscription, one sample work has been posted on the internet, Tales of the Grotesque and Arabesque (2 vols., 1840).

  2. The Geodesic Library CD-ROM of the Virginia edition of James A. Harrison (17 vols, 1902).

  3. Project Gutenberg's electronic version of the "Raven" edition (5 vols., 1903).
For testing, I selected a text common to all three electronic editions, "The Fall of the House of Usher," a standard and perhaps representative Poe work. Two of these versions offer page images and page numbers. Because each of these three electronic texts was based on a different printed edition (1840, 1902, and 1903), and each text was produced in a different way, we should be prepared to encounter such accidentals as differences in spelling, punctuation, capitalization, and italics. Unfortunately several former standard ingredients in the typesetter's case now cause trouble for computers, particularly the 3em dash, the curved quotation and possessive marks, and diphthongs. In addition, accented characters in foreign languages and English verse may be handled differently in the ASCII alphabet for word processors and the HTML alphabet for web browsers.

Early American Fiction

1. The Early American Fiction edition of "The Fall of the House of Usher" (1840) is visually sumptuous. The images will take a while to download, but the result is worth it: the pages are of such high quality you almost feel the paper texture (faint ink-through is indeed visible from the reverse of each sheet). The miniatures of these exquisite images also serve as "page-turners" to guide the reader through the accompanying electronic texts displayed alongside.

One of the first editorial decisions to be made in the preparation of any electronic text from a printed version is how to handle the hyphens introduced by the typesetter at end-of-line word breaks. Ordinarily the electronic editor will drop the typesetter's hyphens and restore all full words for the sake of simplicity in searching. In this edition, however, the typesetter's end-of- line hyphens have been retained to match the e-texts to the accompanying page images not word-for-word but rather line-for- line. The odd result, to take one example, is that the search for "melancholy" in this exact 1840 text will skip over the famous instance in the first sentence of the tale because it is divided there by an end-of-line hyphen into two separate fragmented words, "melan" and "choly."

To begin my study I extracted the electronic text of this edition, ran a spell check, and edited the output, revealing about 80 end-of-line word fragments:

ab, ac, adapta, ap, appella, ar, atmos, atti, binations, ceeding, ceived, choly, chondriac, clusion, col, com, condi, contempla, deavoured, dence, denly, di, dif, dis, discolora, disso, dom, dured, ed, ence, ener, ennuy, evi, ex, ferable, ficient, ficulty, fluence, getic, gipans, gled, gles, horrence, hypo, ick, im, influ, ing, inhe, insuf, irre, lute, lution, meanor, melan, ment, nately, ness, panion, parently, pearances, peated, phere, por, pre, prietor, rangement, sels, sociates, stitions, strug, suf, sur, templation, tention, tinacity, tenuit, tion, tres, treuse, tude, turbed, un, undis, viously, vulge, wordly.
This is hardly a definitive scholarly procedure, and I leave it others to match up all the word fragment pairs, such as "ab/horrence," "ac/tion," and "adapta/tion." But "turbed" lacks its "per" prefix -- because the spell checker will not catch what seem to be ordinary words. Elsewhere, we should be pleased to learn, the EAF text shows painstakingly verification: no scanner "glitch" and no error words survive. We must conclude that the hyphen-made word fragments were left there deliberately.

Geodesic Library

2. The most promising of the three new electronic editions is the Geodesic Library CD-ROM which contains the complete 17 volume Virginia edition of James A. Harrison, still widely available in libraries in the AMS reissue and still frequently cited by scholars. The CD-ROM edition contains page images divided into 17 files, each of which corresponds to one volume of the AMS reissue from which the set was made. The pages are viewed with Adobe Acrobat Reader, which produced the electronic text by an internal process of optical character recognition which links the words to the images. (This edition is not available on the internet.)

Using the same procedure as above, I discovered about 150 end-of-line word fragments:

Afri, aban, abstrac, ac, acter, agita, al, ap, arities, asso, atten, bility, ble, bling, ceeded, ceive, cessive, cessory, choly, ciates, cident, cient, cision, com, condi, consid, cordingly, counte, cumbed, dan, dation, deavour, dence, diately, dis, driac, ence, enced, encoun, erable, ered, ery, evi, ex, experi, fre, ger, gish, grat, hu, hypochon, il, im, imme, improvisa, inex, ing, ingly, insta, instinc, ish, itorum, le, lence, lieve, lor, luth, ly, mances, manity, meanour, melan, ment, ments, multuous, nance, narra, nat, ness, nestly, ney, ning, ob, obtru, ofhrs, ofthe, ognisable, ous, parently, pearances, peculi, phan, posititious, pre, pref, pres, pressible, pression, prietor, prog, quent, rec, reria, ress, rhap, riously, ris, ro, sels, servation, si, sipid, sive, sodies, sor, stroyer, suc, suffi, sumed, supersti, tained, tal, tasmagoric, tention, ter, teration, tered, terior, th, til, tion, tions, tious, tive, tre, trem, trepi, tres, tu, tween, ually, un, ure, ures, usu, viously.
In visually inspecting the Harrison text of "Usher," I found about 30 more end-of-line word fragments which no spell checker will detect because they appear to be normal words:
apart, as, be, boy, cases, child, day, enchant, every, feat, fee, feel, grad, ill, mad, man, men, mock, more, over, pro, profound, retreat, rock, slug, stair, sub, sup, titled.
(In two cases the end-of-line hyphenations fell on the normal break in word compounds: "blood-red" and "hollow-sounding.") But the Geodesic Library edition of the tale contains about 40 errors. It leaves uncorrected several accented or foreign words, "~nnuye’, cozur, Erclesi~, LandafX, Maguntind, Vigili&, welLtun6d," and it retains these instances of scanner "glitch":
&riously, 1 [I], B, Chorus [Chorum], cuhar, e, famiIy, familv, fbund, ffoor, g, H, hi de- signs, hqui~, j [;], lefi, Lking, lzim, m, octave [octavo], ofhrs, ofthe, pitiabIe, pecuhar, quart0, T [I], tenderlv, th e, trickIed, ttte, uplified, v, wheti, wiil, and y.

Project Gutenberg

3. Although the Project Gutenberg "Raven" edition has no page images and no page numbers, its electronic text of "The Fall of the House of Usher" is excellent, with almost no substantive errors. The accompanying explanation states that the e-text follows the computer alphabet in the "Windows ISO-8891 or Latin-1 character set," but in fact the accented letters and diphthongs in this e-text require not a Windows word processor but rather a Web browser which can handle its unadvertised HTML encoding.

In attempting to compare the e-text of the "Raven" edition to its printed original, I failed to recall ever seeing a scholarly citation of the "Raven" edition, nor even a copy in common use in the libraries I usually use. I soon realized that the relation is between the "Raven" edition and this e-text was hard to define. This Project Gutenberg "Raven" e-text never quite claims that it actually derives from the printed "Raven" edition:

Project Gutenberg Etexts are usually created from multiple editions, all of which are in the Public Domain in the United States, unless a copyright notice is included. Therefore, we usually do NOT keep any of these books in compliance with any particular paper edition.
I did not pursue my search for the "Raven" edition but wondered from which source or sources, then, this fine electronic edition came from. The Project Gutenberg e-text of the tale has two oddities which link it to the Poe Society of Baltimore e-text: in the motto, the French word résonne with an acute accent is misspelled with the grave accent as rèsonne, an error which I find only in these two places, and the text concludes with "~~~ End of Text ~~~," the signature marker of the Baltimore editions. Can any reader of this column clarify this matter further?

Conclusions:

Each of these new texts is useful in Poe study. Ironically, the best verified of these new electronic texts, from the "Raven" edition, is the one which Poe scholars will be least likely to use. The careful preparation of the Early American Fiction edition makes it highly suitable for those who require the 1840 edition (subscribers will have access to additional Poe works in the complete set). Although the Geodesic Library has less than perfect textual verification, its visual representation of the 1902 texts makes it highly useful. I have focussed on the fact that In two e-texts of one Poe tale we find about 100 instances of hyphen-divided words which no normal search can find -- broken into about 200 word fragments which no one knows are there. These fragments inflate the word count by about 3% in the EAF and Geodesic editions (about 7400 words), compared to the "Raven" edition (about 7200 words), significant i ncertain kinds of research.

These Early American Fiction, Geodesic Library, and Project Gutenberg electronic editions join several historically- based Poe e-texts already on the internet: the Tales (1845) at UNC in A Digitized Library of Southern Literature at <http://metalab.unc.edu/docsouth/poe/poe.html>; the Borzoi edition (1946) at Stefan Gmoser's site <http://bau2.uibk.ac.at/sg/poe/>, Concordance.com at <http://www.concordance.com.poe.htm>, and elsewhere; and the extensive collection of Poe e-texts previously mentioned at The Poe Society of Baltimore <http://www.eapoe.org/>.

Interesting theoretical questions are being raised by differences between printed texts and electronic texts. End-of- line hyphenation, never an element in the author's manuscript, is an outstanding example of how a printed text reflects the social conditions of its production. Potentially, an electronic text is able to stand closer to the ur-text, the author's presumed intention. But SGML and XML, the state-of-the-art forms of text encoding, are attempting now to add more and more detailed descriptions to electronic texts, first of the physical appearance of the paper texts from which they are derived, and, second, of the semantic content and organizational structure of those electronic texts, carrying virtuality to higher and higher levels.

Poe E-texts reviewed:

Early American Fiction at http://etext.lib.virginia.edu/eaf/pubbrowse/.

Geodesic Library: Publisher, Mr. Claudie D. Holstein, 23471 Wilshire Court, New Caney, Texas 77357 USA, telephone: (281) 689-1103.

Project Gutenberg general index at http://promo.net/pg/.

Other Poe E-texts cited:

Tales (1845) at UNC in A Digitized Library of Southern Literature at http://metalab.unc.edu/docsouth/poe/poe.html

The Borzoi edition (1946) in Stefan Gmoser's site at http://bau2.uibk.ac.at/sg/poe/, in Concordance.com at http://www.concordance.com.poe.htm , and elsewhere.

The extensive collection of Poe e-texts in The Poe Society of Baltimore at http://www.eapoe.org/

References:

Textual variants: three e-texts of the Harrison edition: (Poe Studies, 1997)
http://andromeda.rutgers.edu/~ehrlich/poevars.html

A new Census of Poe E-Texts on the Internet (EAP Review, Spring 2000)
http://andromeda.rutgers.edu/~ehrlich/psa/psa_s00.html

A Poe Webliography (Poe Studies, 1997; rev. May 2000)
http://andromeda.rutgers.edu/~ehrlich/poesites.html


Previous "Poe in Cyberspace" columns are online at
<http://andromeda.rutgers.edu/~ehrlich/psa>
Heyward Ehrlich
Rutgers University
e-mail: ehrlich@andromeda.rutgers.edu