Persée Triplestore


Sommaire

Persée is a national program for the digitization and dissemination of scholarly document collections. On its portal (www.persee.fr) more than 200 collections of journals published from 1820 to the present day are available. The portal allows a search in some metadata and text of more than 600,000 documents. The search interfaces make it possible to use three indexes (documents, illustrations and authors); they are tooled with sorting, facet, and other features, but they do not allow complex queries to be made, the results of separate queries to be cross-referenced, etc.

However, the requests received by the Persée team show that a growing number of researchers no longer consider Persée only as a library to retrieve documents and build a bibliography, but as a research corpus in its own right.

This way of looking at Persée content implies, in order to be viable, adding more elaborate browsing and query tools to the corpus than those available on the portal. Semantic web concepts and technologies provide an answer to this need.

How is Persée triplestore built?

What data?

What distinguishes Persée from other sites providing scholarly documents is the nature of its content (mainly exhaustive collections of journals) and their structure (collection, volume, issue). It is therefore not a warehouse of reprints, but an organized library containing collections selected according to their subjects, scientific value, etc. The other particularity of Persée is the historical depth of its contents: some of our collections began to appear in the middle of the 19th century, and, to enable the reader to correctly interpret each document, it is absolutely essential to situate it in the historical context of its initial publication, i.e. scientific knowledge of the time, the vocabulary to describe it, the exchange use and means between scientists changed over time (for example, we will not speak of evolution in the same way today and at the end of the 19th century, when Darwinism appeared). Finally, the scientific context of publication is also essential: many of Persée collections offer thematic issues, the constitution of which has been entrusted to a scientific publisher. Thus, the same historian, a specialist in the Canuts revolt in Lyon, may produce totally different documents depending on the publication concerned: a journal of general-interest history, industrial history, history of trade unionism, a journal in sociology, economics or political science. Without information on this initial “perspective”, the reader will not be able to approach the document in an informed way.

Before its contexts of dissemination (library) and publication, each document is the result of a given scientific context of production. The author, who has, at a given moment, knowledge in a particular field, supported by his readings, has taken up a research subject on which he presents his own thoughts. Thus, the state of knowledge on the research object at the time of writing, the author’s own bibliography, his scientific collaborations, the bibliographical references he highlights, etc., are all information that will enable the reader to correctly understand each document.

In the triplestore proposed by Persée, these different elements of knowledge have been described and added to the traditional bibliographic description (title, author, abstract, etc.) of each document. As a result, they can also be queried by the user.

How to represent data and the tools for it

The information that can be described in Persée triplestore is therefore of a very varied nature. To express them in RDF, we used different pre-existing vocabularies, each one allowing us to express a type of data precisely, in a form that can be used within our triplestore, and reused by others.

To represent the contents of a library hosting digital reproductions of documents initially published in paper form (mainly), the formalism proposed by the FRBR (Functional Requirements for Bibliographic Records) model seemed appropriate. It allows to make a distinction between the notions (and their specific characteristics) of abstract creation (work), intellectual content (expression), publication (manifestation) and copy (item). It thus makes it possible to describe precisely the status of the objects proposed by Persée; for example, an online article is an electronic event, produced by Persée in 2015, by digitising a printed event, published in 1891, in a particular journal, by a particular publisher, etc..

The vocabulary proposed by the DCMI (Dublin Core Metadata Initiative) is the most widespread model to describe documents and this is why we chose it.

To describe individuals and the relationships between them, we chose FOAF (Friend Of A Friend) vocabulary, which is also widely used.

To describe the links between documents (citation, analysis or review, text sequence, response…), we used the CITO vocabulary (Citation Typing Ontology).

To describe the nature of the documents, the BIBO (Bibliographic Ontology) vocabulary was used, supplemented by a “homemade” vocabulary that makes it possible to describe the typology of the documents proposed by Persée accurately[soon, you can consult a documentation page about persee-ontology.owl].

Finally, to express the concepts discussed in the documents, the SKOS (Simple Knowledge Organisation System) vocabulary was used. When these concepts are described elsewhere, we have reused the models proposed by specialists in the field and have established alignments between Persée content and the repositories in force in each community, in order to allow users of our triplestore to “bounce” to other sources of information. Some of these alignments have been the subject of specific procedures, which will be described in the following section.

A first modeling of our data, based on a detailed use of these different vocabularies, was implemented and proposed to a panel of users (mostly the same researchers who had expressed the need to perform a complex extraction of Persée data). If this representation was rich, it had the disadvantage of being very (too) dense, and presupposed, to be usable, to have a solid understanding of RDF, its Sparql query language, and each of the different vocabularies used, etc.

These requirements are an obstacle for many potential users of the triplestore. To overcome this complexity, three strategies have been developed in parallel to simplify data modelling and the tools available to use the data.

  • The data model we are proposing has been streamlined and refocused on describing the actual resources on the Persée portal (for example, there is no longer any notion of work, expression, etc.).
  • The Sparklis tool, developed by Sébastien Ferré of IRISA, was selected to enable natural language queries into the triplestore and numerous tutorials were created to further facilitate its use.
  • Several tools to visualize the extracted data have also been selected to facilitate the exploitation of query results.

From data to linked data

Persée contents reflect a state of knowledge at a given moment, sometimes ancient. But this one is not fixed, it is therefore necessary to be able to bounce from Persée resource towards current knowledge and other information on the same subject or from the same author. For this reason, the Persée team strives to open up its content and establish links between the documents it has been entrusted to and other information systems. To date, three data sets have been processed to establish such links:

  • the authors,
  • species names for SVT collections
  • the monuments of Cairo for the collection of Bulletins Committee for the Conservation of Monuments of Arab Art.

The SVT collections and the collection of the Bulletins Comité de Conservation des Monuments de l’Art Arabe will soon be added to the triplestore and portal, in the form of “mash-ups” pages that will compile information on these harvested resources, both on the Persée triplestore and on the triplestores of partner structures.

Alignment with IdRef – A mutual enhancement

The author is a fairly natural source to bounce from a document. At the global level, several international standards exist (Viaf, Isni, Orcid, etc.) maintained by different entities, meeting objectives and offering complementary services, most of them are synchronized with one another. In France, two national repositories coexist and synchronize the data they share. One is maintained by the Bibliothèque nationale de France, the other, IdRef, by Abes (Agence bibliographique de l’enseignement supérieur). We preferred the second one because it was closer to the population of Persée authors.

Two alignment procedures have been implemented:

  • The first is manual: each time a new author appears during the processing of a collection, IdRef is queried. In response, it provides a list of candidates described by their surname, first name, dates of birth and death, and references of the works attributed to this author (in the Sudoc national catalogue). When identified iwith certainty by the librarian, he/she selects the right candidate, thus establishing a link.
  • The second is automatic: as part of its Qualinca project (Quality and interoperability of major documentary catalogues), Abes has set up tools to make alignments. An extract of the data produced by Persée (author, lists of contributions) is thus compared with the content managed by Abes to produce new alignments.

Beyond linking between records describing people in each of our information systems, the partnership with Abes aims at mutual data improvement. Whether manually or automatically, any inconsistency or conflict is reported and analysed. The data from one or the other of the two reference files are then completed and/or corrected.

For users of our “classic” websites, these alignment work are immediately used to enable the user to access the works authored by an individual, locate them in the library, identify the theses that he/she has supervised or evaluated based on the author record retrieved the Persée portal,….

Conversely, from the Abes sites, the user will be able to retrieve the list of articles written by each of the individuals whose name is included in the alignment.

With the ability of doing federated searches (querying several information systems at the same time, this alignment makes it possible for users of the Persée triplestore to query Persée data and also those available on other sites in one single query.

This first Persée/IdRef alignment is also used by the Persée team to establish new links to other repositories (BNF, Isni, Viaf, Wikipedia, etc.), thus enhance Persée information system and improve its consistency since, here again, anomalies are analyzed, reported to the institutions maintaining these repositories and corrected.

Quality management (exhaustiveness vs. confidence/reliability?)

The three pilot projects (Persée collections, the alignment of SVT collections with the GBIF framework, and the monuments in Cairo) are very different in their scopes and objectives:

  • ATHAR’s alignment has a limited scope but in-depth work is carried out, mainly by the scientific partner responsible for the collection
  • The alignment with IdRef has a very broad scope but is based on an active partnership and on each team’s in-depth understanding of the data to be processed (personal authority records)
  • The alignment with GBIF also has a very broad scope, but it has been not based in its initial phase, on an established partnership or on the specific skills of the life sciences team.

Concerning the links established between Persée resources (quotations, sequences, answers,, etc.) here again, algorithms are looking for candidate links that are validated (or not) by the Persée team.

In each of these different cases, if the linking decision is made by means of computer-assisted tools, it is the responsibility of a human operator (scientist in the field or IST specialist).

There is a strong temptation to offer all-round links as soon as a resource looks similar to another, but, in a context where data are widely exposed, repeated, cross-referenced, enriched and redistributed, the posting of unreliable information online can become a major source of noise and undermine the trust placed in their source. Therefore, it was decided to limit the use of verified links only in Persée.