Because digital pages don't turn yellow.
November 17, 2004 12:35 PM Subscribe
The National Endowment for the Arts and the Library of Congress are putting 30 million newspaper pages online. The National Digital Newspaper Program "will create a national, digital resource of historically significant newspapers from all the states and U.S. territories published between 1836 and 1922." The goal is to have it done in 20 years; the LOC has a sample up now: The Stars and Stripes from 1918-1919.
Very cool - the research implications alone are fantastic.
depressingrealityhat
Considering how fast our education supports are failing, though, in 20 years our kids will be staring slack-jawed at the pretty pictures while we read and reminisce.
/depressingrealityhat
posted by FormlessOne at 1:37 PM on November 17, 2004
depressingrealityhat
Considering how fast our education supports are failing, though, in 20 years our kids will be staring slack-jawed at the pretty pictures while we read and reminisce.
/depressingrealityhat
posted by FormlessOne at 1:37 PM on November 17, 2004
Awesome use of the public domain. I wonder how many great stories are locked away in all those papers.
posted by mathowie at 2:05 PM on November 17, 2004
posted by mathowie at 2:05 PM on November 17, 2004
I'm surprised that OCR works on old newsprint. I wonder how accurate it is? While it's great to have the imaged pages like this, I wish they were also available in HTML.
posted by _sirmissalot_ at 2:18 PM on November 17, 2004
posted by _sirmissalot_ at 2:18 PM on November 17, 2004
_sirmissalot_, provided the original was printed clearly enough, OCR should work just fine on newsprint. Obviously some pages will be worse than others, and some display type may be difficult to read, especially in ads, but the vast majority should be easily managed.
posted by me3dia at 2:26 PM on November 17, 2004
posted by me3dia at 2:26 PM on November 17, 2004
They say they're primarily scanning microfilm. So I don't think they have to deal with decripit old yellowed papers.
posted by smackfu at 2:37 PM on November 17, 2004
posted by smackfu at 2:37 PM on November 17, 2004
I was mainly thinking that the type is appears pretty inconsistent, plus all of the visual pops and hisses. But obviously the filtering technology is pretty good nowadays. I haven't played around with OCR since the late 90's, and everything had to be *just right* to get a decent result.
posted by _sirmissalot_ at 2:43 PM on November 17, 2004
posted by _sirmissalot_ at 2:43 PM on November 17, 2004
Now THIS is what government is for.
Expect it to be sold off to a pay-only database soon. Bill Gates, are you paying attention?
posted by rushmc at 4:03 PM on November 17, 2004
Expect it to be sold off to a pay-only database soon. Bill Gates, are you paying attention?
posted by rushmc at 4:03 PM on November 17, 2004
This seems to be an outgrowth of the American Newspaper Project -- which oddly does not include the American Newspaper Repository, the famous collection acquired by Nicholson Baker and stored in an old mill, which has been donated to Duke. [gallery, prior discussion]. Let's hope this is a temporary omission.
posted by dhartung at 7:41 PM on November 17, 2004
posted by dhartung at 7:41 PM on November 17, 2004
Very very nice.
Still, the URLs could be improved. For example, I wish the soldier's poems on this page were more directly linkable. I was able to link directly to a poem by peeking at the URL of the image, but this should be more user-friendly.
Also it's yucky to have cgi-bin/np_item.pl? appear in a URL. I hope memory.loc.gov and its contents will be around long after np_item.pl has been junked. Not that it doesn't seem to be a great little program, of course, but it should be more more modest and hide behind an appropriate server URL rewrite instruction. Hmm I guess I should be writing this to them... OK I just did.
posted by Turtle at 3:30 AM on November 18, 2004
Still, the URLs could be improved. For example, I wish the soldier's poems on this page were more directly linkable. I was able to link directly to a poem by peeking at the URL of the image, but this should be more user-friendly.
Also it's yucky to have cgi-bin/np_item.pl? appear in a URL. I hope memory.loc.gov and its contents will be around long after np_item.pl has been junked. Not that it doesn't seem to be a great little program, of course, but it should be more more modest and hide behind an appropriate server URL rewrite instruction. Hmm I guess I should be writing this to them... OK I just did.
posted by Turtle at 3:30 AM on November 18, 2004
For a historian working in flyover country miles and miles from anything like a research library, this will be a godsend!
posted by LarryC at 6:08 AM on November 18, 2004
posted by LarryC at 6:08 AM on November 18, 2004
I hope they hurry up and get to the WW II Stars and Stripes so I can see all the Bill Mauldin cartoons.
posted by alumshubby at 1:45 PM on November 18, 2004
posted by alumshubby at 1:45 PM on November 18, 2004
For those of you creaming your panties about doing this on the Internet, this isn't a new idea, except the not charging users for it part. Newspaper Archive. Paper of Record. Proquest Hisorical Newspapers.
Microfilm isn't necessarily any better. The original source of the film has to be good, the film itself has to be good quality, the developing has to be good, the preservation and care of the film have to be good, the scanning of the film has to be good, and the OCR has to be good.
I spend much of each day in digital archives, and let me tell you: while I know this is a lot of work, it's very, very easy to screw it up, too.
Regarding the Olive software: it blows. I hope they don't use it. It just crashed my browser, which is not unusual with it. The short list of problems basically revolve on its overdependence on Javascript, meaning that it behaves oddly in some browsers and that it requires an unusual amount of client-side processor power. Also, the Olive software is not well-designed for the researcher: it seems more useful to the person who wants one page, than for the person who wants 1000.
Also, the two sites I use regularly that use the Olive software--the University of Missouri archive and The Brooklyn Eagle, do not offer the resulting pages in PDF form, much less as searchable PDFs (the ideal product of this sort of digital archive).
There is a single great role model for all this, and I hope they emulate it: The Making of America at the Univ. of Michigan, NOT the Cornell one. It offers good sources, the ability to view results chronological, different results formats (including the actual OCR text, PDF, and as graphics at multiple sizes). Probably next on my list would be Proquest Historical Newspapers (including the APS set), second mainly because it's PDFs are not searchable.
posted by Mo Nickels at 2:39 PM on November 18, 2004
Microfilm isn't necessarily any better. The original source of the film has to be good, the film itself has to be good quality, the developing has to be good, the preservation and care of the film have to be good, the scanning of the film has to be good, and the OCR has to be good.
I spend much of each day in digital archives, and let me tell you: while I know this is a lot of work, it's very, very easy to screw it up, too.
Regarding the Olive software: it blows. I hope they don't use it. It just crashed my browser, which is not unusual with it. The short list of problems basically revolve on its overdependence on Javascript, meaning that it behaves oddly in some browsers and that it requires an unusual amount of client-side processor power. Also, the Olive software is not well-designed for the researcher: it seems more useful to the person who wants one page, than for the person who wants 1000.
Also, the two sites I use regularly that use the Olive software--the University of Missouri archive and The Brooklyn Eagle, do not offer the resulting pages in PDF form, much less as searchable PDFs (the ideal product of this sort of digital archive).
There is a single great role model for all this, and I hope they emulate it: The Making of America at the Univ. of Michigan, NOT the Cornell one. It offers good sources, the ability to view results chronological, different results formats (including the actual OCR text, PDF, and as graphics at multiple sizes). Probably next on my list would be Proquest Historical Newspapers (including the APS set), second mainly because it's PDFs are not searchable.
posted by Mo Nickels at 2:39 PM on November 18, 2004
It's too bad we don't have the original copyright laws. Everything up to 1979 would be public domain and could be fodder for this kind of project.
posted by Mitheral at 2:43 PM on November 18, 2004
posted by Mitheral at 2:43 PM on November 18, 2004
« Older Homosexual agenda v. Christian soldiers | A DeLayed Rule Change Newer »
This thread has been archived and is closed to new comments
posted by das_2099 at 12:52 PM on November 17, 2004