What is a triplestore?


Sommaire

Data.persee.fr is a triplestore. Allright, so what?

A triplestore is a database that contains only RDF triples. In fact, since a database is a set of tables, with rows and columns linked together, it is more commonly referred to as a “graph” to describe a triplestore in which all the data are linked together. The RDF (Resource Description Framework) is a graph model designed to formally describe Web resources and their metadata, so that these descriptions can be automatically processed.

What is a triple?

A triple is a descriptive statement, as basic as a description can be. It is composed of three parts: a subject, a predicate (a verb, a relationship) and an object.

A triple is used to describe the world. Any object, material or conceptual, can be described in triples. Try with what is around you :

copier le petit symbole bleu

The tea bottle is on my desk

The mouse is placed on my desk

The headset is on my desk

The eraser is on my desk

The glasses cleaner cloth is placed on my desk

The tea bottle is blue.

The mouse is black

The helmet is black

The gum is white

The lens cleaner fabric is blue

The tea bottle weighs 213 g

The mouse weighs 186 g

The helmet weighs 248 g

The rubber weighs 19 g

The glasses cleaner fabric weighs 3 g

The tea bottle is made of polycarbonate material

The mouse is made of styrenic copolymer material

The helmet is made of styrenic copolymer material

The rubber is made of rubber material

The lens cleaner is made of polyamide material

The headset contains electronic components

The mouse contains electronic components

The tea bottle contains tea

We would go on like this for a very long time, adding information on the origin, date of manufacture, owners of these objects, etc.

Once our database of triples has been created, we can query it :

Give me everything that’s blue in color:

→ The tea bottle, the cloth cleaning glasses.

Give me everything that contains something and what it contains by ranking the results by decreasing weight:

→ The helmet >> Electronic components

→ The tea bottle >> Tea

→ The mouse >> Electronic components

Give me everything that contains something by limiting the results to blue objects:

→ The tea bottle

We can now link our database with another one, which for example defines plastics, and ask it for everything that is manufactured in a petroleum derivative. This information does not appear in our database, but as in the second database the materials polycarbonate, styrenic copolymer and polyamide are defined, in the form of triples, as petroleum derivatives, we will obtain :

→ Tea bottle, mouse, helmet, glasses cleaner cloth

What can be found in Persée triplestore?

In Persée, we have articles, illustrations, collections, authors, etc., all of which have been described in triples. All information about each of these “objects” is kept in Persée triplestore and is now searchable. The following diagrams show what information is available for each type of entity and how it is related to each other.

The predicates (relationships) used are mostly chosen from pre-existing and recommended vocabularies, and used by other triplestores that process the same kind of data.

Persée triplestore uses, among other things, the “foaf” vocabularies that define relationships between people, “bibo that defines relationships between bibliographic entities, “dcterms” (Dublin core) that defines a set of classic metadata, “cito” that defines links between documents.

All this information about Persée content is “metadata“, i.e. data that describes data. A triplestore is therefore a database (or rather a graph, once again) of metadata. When making queries into a triplestore, you are not looking for the same information as in a “classic” database. We are not interested here in the content but in everything that qualifies it. I8n this case, we will not look for a particular article but for all the articles corresponding to this or that criterion, …or for the number of articles meeting these criteria distributed by year, by author or by collection. The particularity of Persée remains that if your results include a list of article identifiers you can enter these URIs (see below) in a browser and access the content to which they point (this operation is called “dereferencing” URIs) and retrieve the full text of these resources.

Each content being described through a large number of properties and relationships, the database is big with its 30 327 681 triples!

URI, predicates and vocabularies

A resource is identified by a URI (Uniform Resource Identifier), a unique identifier that resembles a URL but does not necessarily point to a file: a URI does not identify a file, but the resource itself as an abstract entity.

URLs are URIs whose prefix is’http’ and whose particularity is to identify a resource mainly by the mechanism that provides access to it (for example, its location on a server, the address of a link resolver attached to access parameters, etc.). This is also referred to as “dereferable URI”.

In a triple, the subject and predicate are expressed by URIs, the object can be a URI or a string.

Vocabularies, as understood here, are lists of predicates related to a given field. Predicates are URIs, they have a prefix that refers to a domain and the property itself, defined on the domain’s site. Thus, all schemes that use the predicate “dcterms:publisher” use the URI “http://purl.org/dc/terms/publisher” and agree on the definition given by the Dublin Core Metadata Initiative: “An entity responsible for making the resource available.” The use of widely used vocabularies ensures that different triplestores call the same concept by the same name.

Triplets are used to “describe the world”. Vocabularies are lists of predicates (properties) that are used to describe a particular world. The vocabulary foaf (Friend of a friend) describes people and relationships between them, the vocabulary RDA applies to library collections. Persée triplestore combines a number of these vocabularies to describe all its resources. The list of predicates used is consulted on data schema mind maps. You can also access a list of predicates attached to a resource through this form.

The basic principles of Sparql / How to question a triplestore?

You make a search in a triplestore by submitting requests to it in a dedicated computer language: Sparql.

This language has some similarities with SQL, for those who know this query language to query “classic” databases.

A basic standard request consists of:

one or more calls to the vocabularies used,

a SELECT line in which you name the type of information you are looking for, this name is a “variable” it can be called ?x ?thatstuff or a meaningful word like ?name. A Sparql variable starts with a question mark.

a WHERE line in which you define which of these items of information you target according to their relationships with others. The ?x I am looking for is the object of the predicate cited.

You can then limit, sort and cross-reference the results.

Examples:

PREFIX foaf: <http://xmlns.com/foaf/0.1/> –> You will use the vocabulary foaf. When you use “foaf:” it will be like looking for the next predicate in the namespace at this address

. PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>–> you will also use the rdf-syntax vocabulary. When you use “rdf:” it will be like looking for the next predicate in the namespace at this address.

SELECT ?x –> what you are looking for

WHERE { ?y rdf:type foaf:Person . will search for foaf:Person resources and call them ?y

?y foaf:name ?x. the name as defined by the foaf vocabulary attached to these Person entities is the information you are looking for

LIMIT 200 –> limits the results to 200

Two hundred persons’ names available in the database will be displayed.

A number of Sparql tutorials are available online, here are some links, in French and English.

http://web-semantique.developpez.com/tutoriels/jena/arq/introduction-sparql/

https://fr.wikiversity.org/wiki/SPARQL_Protocol_and_RDF_Query_Language/Requ%C3%AAtes_de_lecture

https://jena.apache.org/tutorials/sparql.html

http://www.linkeddatatools.com/querying-semantic-data

http://www.cambridgesemantics.com/semantic-university/sparql-by-example

But since not everyone has the courage to learn a new language, we offer you an innovative solution to formulate requests on Persée triplestore: Sparklis.

What are the advantages of a triplestore?

In a triplestore, searches can be made everywhere! Even where you don’t expect answers. When you search into a library, you look in a shelf, in a section, in a room. And again in another shelf, another section, another room. We ask for a librarian’s assistance, a person who knows the collection, and we even ask what we may have forgotten. We do some additional searches, cross-reference them with other keywords, with those of colleagues… In a triplestore, a query applies to all collections, taken as a single overall corpus, since the data are linked to one another, it is a single corpus on which you can make exhaustive searches.

But this may bring about “noise”. With a Sparql request, you can be extremely precise. Using a regular expression you can search for the words rock, rocks, rocker, rockers, rocket, rockets, rock’n roll, rockabilly, rocky, rockies without getting the words crock, crocked, crocking, crocker, brocket, frock, skyrocket or bedrock. You can limit your search to a specific time period, or to a publisher. Some results can be excluded from the outset: ask for the word table but not “table of contents” or “round table”. When  you do a search in a library, real or virtual, you get a book, magazine or article. Or many books, magazines or articles. All these references must then be recorded, noted or copied and pasted. In a triplestore, you obtain a results table. Which can be downloaded. All the information is there, ordered, in a storable, reusable, “rich” format. Once imported into a spreadsheet on your desktop computer, you can add as many columns as necessary to your table and enter your own annotations. If you make a query including the identifiers of the articles, you will be able to click on them and access the full text on Persee.fr, while having the list of data you are interested in (titles, dates, authors, etc.). So you get the advantages of storage, durability, customization.

A triplestore is not an isolated thing! If one speaks about about “linked data”, this means that the data in a triplestore are linked even beyond one’s database. We have aligned our data with other repositories. That is, by means of specific triples we have identified some of our resources as strictly the same “objects” as those present in other databases. Authors in Persée are aligned with those of the BNF, our living species with those of the Global Biodiversity Information System (GBIF), our monuments with the Athar database. By means of a federated query, users can search from our entry point for information held in other triplestores .

And a triplestore can be downloaded. By downloading a data set, you can install all or part of our data on your desktop computer and reuse the data as you wish (within the limits of our general conditions of use, of course!).