View on GitHub

Data Mining Historical Newspaper Metadata

Old news teaches history

Download this project as a .zip file Download this project as a tar.gz file

Newspapers from European digital libraries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using a METS/ALTO format.

From each digital document is derived a set of bibliographical, descriptive and quantitative metadata relating to content and layout (date of publication, number of pages, articles, words, illustrations, etc.). XSLT or Perl scripts are used to extract those metadata from METS manifest and OCR files.

The BaseX XML database and XQuery language are then used to search the datasets and output graphs.


Articles, blogs

Dataset

The complete set of derived data contains about 5,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) from BnF press collections (Gallica, www.gallica.fr):

Download Datasets (147,978 issues) :

Note : the OCRed text of the Europeana Newspapers corpus is also available.

Datasets with illustrations’ caption text

API

Charts

Made with Highcharts and Google Charts.

Page dimensions

Journal des débats politiques et littéraires : Page format (complete dataset, interactive timeline)

Ouest-Eclair (Ed. Nantes) : Page format (complete dataset, interactive timeline)

Pages number

Average number of pages per issue (timeline)

Average number of pages per issue per title (timeline)

Articles

Average number of articles per issue (timeline)

Average number of articles per page (timeline)

Le Matin : Average number of articles per issue (interactive timeline)

Illustrations

Average number of illustrations for 1,000 pages

Average number of illustrations per page (timeline)

Average number of illustrations per page per title (timeline)

Journal des débats politiques et littéraires : Number of illustrations per issue (complete dataset, interactive timeline)

Front page

Average number of front page illustrations (timeline)

Average number of front page illustrations per title (timeline)

Le Petit Journal illustré :

Le Petit Parisien : Average number of illustrations per page (interactive timeline)

Ouest-Eclair : Number of illustrations on the front page (complete dataset, interactive timeline)

Words

Average number of words per page

Journal des débats politiques et littéraires : Number of words per page (complete dataset, interactive timeline)

Tables

Average number of tables per issue (timeline)

Layout and form factors

Average number of articles, illustrations and illustrations on front page (per page)

Page format; and words, illustrations, ads density (per page)

Number of pages; words, illustrations, ads density (per surface)

Content types

Average number of blocks per issue (timeline):

Data Quality

Issues per year: Whole dataset

Missing issues: Journal des débats politiques et littéraires (calendar)

Timeline

Showcase timeline for the Journal des débats politiques et littéraires

Author

2015, @altomator

Contact : jean-philippe.moreux@bnf.fr

This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)

EN