Newspapers from European digital libraries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using a METS/ALTO format.
From each digital document is derived a set of bibliographical, descriptive and quantitative metadata relating to content and layout (date of publication, number of pages, articles, words, illustrations, etc.). XSLT or Perl scripts are used to extract those metadata from METS manifest and OCR files.
The BaseX XML database and XQuery language are then used to search the datasets and output graphs.
Articles, blogs
- “Mining, Visualising and Analysing Historical Newspaper Data: the French National Library Experience” (presentation). Digital Approach towards serial publications, Ghent Centre for Digital Humanities (GhentCDH) (Bruxelles, September 2017)
- “Innovative Approaches of Historical Newspapers: Data Mining, Data Visualization, Semantic Enrichment” (article), presentation. IFLA News Media section (Lexington, August 2016)
- “Data Mining Historical Newspapers Metadata” (article), presentation. IFLA News Media section 2016 (Hamburg, April 2016)
- “Data Mining Historical Newspapers Metadata” (poster). Documents Analysis Systems (Santorini, April 2016)
- Blog posts (Fr) : 1, 2
Dataset
The complete set of derived data contains about 5,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) from BnF press collections (Gallica, www.gallica.fr):
- Le Matin: see on Gallica
- Le Gaulois: see on Gallica
- Le Petit journal illustré: see on Gallica
- Le Journal des débats politiques et littéraires: see on Gallica
- Le Petit Parisien: see on Gallica
- L’Ouest-Eclair (Rennes): see on Gallica
- L’Ouest-Eclair (Nantes): see on Gallica
Download Datasets (147,978 issues) :
- Le Matin (1884-1942, 21,846 issues) : CSV / XML / JSON
- Le Gaulois (1868-1929, 21,241 issues): CSV / XML / JSON
- Le Petit journal illustré, supplément du dimanche (1884-1920, 1,899 issues): CSV / XML / JSON
- Le Journal des débats politiques et littéraires (1814-1944, 45,334 issues) : CSV / XML / JSON
- Le Petit Parisien (1876-1944, 23,168 issues): CSV / XML / JSON
- L’Ouest-Eclair, Rennes (1899-1942, 25,108 issues) : CSV / XML / JSON
- L’Ouest-Eclair, Nantes (1915-1942, 9,382 issues) : CSV / XML / JSON
Note : the OCRed text of the Europeana Newspapers corpus is also available.
Datasets with illustrations’ caption text
- Le Matin : XML
- Le Gaulois : XML
- Le Petit journal illustré, supplément du dimanche : XML
- Le Journal des débats politiques et littéraires : XML
- Le Petit Parisien : XML
- L’Ouest-Eclair, Rennes : XML
API
- Illustrations search in the datasets: see on the Github to try XQuery HTTP APIs using BaseX (XML database engine and XPath/XQuery processor)
Charts
Made with Highcharts and Google Charts.
Page dimensions
Journal des débats politiques et littéraires : Page format (complete dataset, interactive timeline)
Ouest-Eclair (Ed. Nantes) : Page format (complete dataset, interactive timeline)
Pages number
Average number of pages per issue (timeline)
Average number of pages per issue per title (timeline)
Articles
Average number of articles per issue (timeline)
Average number of articles per page (timeline)
Le Matin : Average number of articles per issue (interactive timeline)
Illustrations
Average number of illustrations for 1,000 pages
Average number of illustrations per page (timeline)
Average number of illustrations per page per title (timeline)
Journal des débats politiques et littéraires : Number of illustrations per issue (complete dataset, interactive timeline)
Front page
Average number of front page illustrations (timeline)
Average number of front page illustrations per title (timeline)
Le Petit Journal illustré :
- Average number of illustrations on the front page (interactive timeline)
- Number of illustrations on the front page (complete dataset, interactive timeline)
Le Petit Parisien : Average number of illustrations per page (interactive timeline)
Ouest-Eclair : Number of illustrations on the front page (complete dataset, interactive timeline)
Words
Average number of words per page
Journal des débats politiques et littéraires : Number of words per page (complete dataset, interactive timeline)
Tables
Average number of tables per issue (timeline)
Layout and form factors
Average number of articles, illustrations and illustrations on front page (per page)
Page format; and words, illustrations, ads density (per page)
Number of pages; words, illustrations, ads density (per surface)
Content types
Average number of blocks per issue (timeline):
- Le Matin
- Le Gaulois
- Le Petit journal illustré
- Le Journal des débats politiques et littéraires
- Le Petit Parisien
- L’Ouest-Eclair (Rennes)
Data Quality
Issues per year: Whole dataset
Missing issues: Journal des débats politiques et littéraires (calendar)
Timeline
Showcase timeline for the Journal des débats politiques et littéraires
Author
2015, @altomator
Contact : jean-philippe.moreux@bnf.fr
This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)