6. Vocabularies and Ontologies

In earlier sections we looked at Semantic Web technologies that are related to how metadata can syntactically be added to information. An essential aspect of achieving the goal of the intended meaning of the data publisher being the same as the meaning understood by data consumers is the use of vocabularies with well-defined semantics. One factor that greatly facilitates this is the use of reference vocabularies, or more commonly used vocabularies. In this document, the terms vocabulary and ontology are used interchangeably. There is no clear separation between the two concepts in the literature, and, in many cases, the term "vocabulary" is used for simpler ontologies.

Making an analogy with everyday communication, we can see that each particular group of people uses specific vocabularies in their conversations and message exchanges. People group together for different reasons: geographic location, family, professional and social relationships, and in countless other situations, with a wide range of characteristics. Two doctors talking about a particular surgical procedure will use terms from a specific vocabulary. Likewise, the description of the nutritional values ​of a package of snack food uses its own vocabulary. Different vocabularies have been created for conversations in social networks, including graphic vocabularies, such as emoticons.

To make the Semantic Web scenario complete, a set of reference vocabularies needs to be established, in order to facilitate the communication of metadata. Readers need to be aware that for each specific publication, a search should be made regarding existing vocabularies that can be used. There are some catalogs that can help users find ontologies, such as LOV [75], BioPortal [72] and JoinUp [76]. In cases where no vocabulary meets the expressiveness needs of the metadata, a new ontology can be created, seeking to reuse as many elements of already existing ontologies as possible, to avoid duplication of references for the same concepts.

Every vocabulary is described by a document pointed to by a URI. For example, FOAF vocabulary has a URI “http://xmlns.com/foaf/0.1/”. The reference URI for the classes and properties of each vocabulary is built from concatenating the URI of the vocabulary with the name of the respective class or property. For example, the property “name” of FOAF has a URI “http://xmlns.com/foaf/0.1/name”. In the following sections some of the most popular and historic reference vocabularies will be presented.

6.1 Dublin Core

The Dublin Core Metadata Initiative [77] (DCMI), whose first event was held in 1995, is one of the first initiatives to explore the definition of a vocabulary for describing metadata, based on the premises that the descriptions are independent in relation to the syntax and have well-defined semantics. The vocabulary serves to describe resources (documents), and started off with a basic set of 15 properties, which were similar to library cataloging elements:

• title – name of the resource.

• creator – name of the creator of the resource.

• subject – topic of the resource.

• description – description of the resource, which may be a synopsis, a summary, etc.

• publisher – entity responsible for making the resource available.

• contributor – name of the contributor to the resource.

• date – date associated with the resource.

• type – type of resource.

• format – file format, physical means of storage or size of the resource.

• identifier – a unique reference to the resource within a certain context.

• source – source that originated the resource, for example, the result of a service.

• language – language of the resource.

• relation – relation between two resources.

• coverage – temporal or spatial coverage of the resource, such as a jurisdiction.

• rights – rights associated with the resource.

One of the uses of this vocabulary is the documentation of web pages. Figure 6.1 presents an example of the use of the vocabulary utilizing the HTML tag <meta>. The properties of Dublin Core are frequently reused in other ontologies.

<head>
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/">
<meta name="DC.title" content="Guia de Web Semântica">
<meta name="DC.creator" content="Carlos Laufer">
<meta name="DC.subject" content="Web Semântica">
<meta name="DC.subject" content="Dados Conectados">
<meta name="DC.description" content="Introdução ao Ecossistema e às Tecnologias de Web Semântica e Dados Conectados">
<meta name="DC.publisher" content="W3C Brasil">
<meta name="DC.date" content="21/01/2015">
</head>
Figure 6.1 Example of the use of Dublin Core Vocabulary

As the project evolved, the vocabulary was extended through the creation of a larger set of properties and classes [78]. Among the properties added are: “abstract,” “audience,” “conformsTo,” “hasVersion,” “replaces,” “requires,” etc. Among the classes created are: “FileFormat,” “Jurisdiction,” “Location,” “ProvenanceStatement,” and “MediaType.”

6.2 FOAF

The vocabulary Friend of a Friend [79] (FOAF) emerged at the beginning of the year 2000 and is an appropriate vocabulary for defining metadata about people and their interests, relationships and activities. The vocabulary [13] has a core set of classes (first letter capitalized) and properties (first letter in lower case):

• Agent – things that carry out something and can be people, organizations, robots, etc. “Person,” “Organization” and “Group” are subclasses.

• Person – core entity of the vocabulary: represents people.

• name – string of characters with a name.

• title – form used to address people, such as "Mr.," and "Mrs."

• img – an image that represents a person.

• familyName – describes part of the name of a person (surname).

• familyName – descreve parte do nome de uma pessoa (sobrenome).

• givenName – describes part of the name of a person (first name)

• knows – relates two people.

• based_near – spatial relationship between two things.

• age – the age of a person.

• made (maker) – something made by someone.

• primaryTopic (primaryTopicOf) – main topic of a document.

• Project – a project.

• Organization – an organization.

• Group – a group.

• Member – a member of a group.

• Document – a document.

• Image – an image.

In addition to this central core, there is also an extension of classes and properties related to social characteristics of the Web, such as, “nick,” “mbox” (e-mail), “homepage,” “publications,” and “account.”

One of the ideas created with FOAF vocabulary was that each person could define a file, called an FOAF file, with their personal information, such as FOAF file of Tim Berners-Lee [80]. Figure 6.2 reproduces part of the information contained in this file.

@prefix dc11: <http://purl.org/dc/elements/1.1/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>
  cc:license <http://creativecommons.org/licenses/by-nc/3.0/> ;
  dc11:title "Tim Berners-Lee’s FOAF file" ;
  a foaf:PersonalProfileDocument ;
  foaf:maker <http://www.w3.org/People/Berners-Lee/card#i> ;
  foaf:primaryTopic <http://www.w3.org/People/Berners-Lee/card#i> .
<http://www.w3.org/People/Berners-Lee/card#i>
  foaf:img <http://www.w3.org/Press/Stock/Berners-Lee/2001-europaeum-eighth.jpg>;
  foaf:knows
   <http://bblfish.net/people/henry/card#me> ,
   <http://danbri.org/foaf#danbri> ,
   <http://dbpedia.org/resource/John_Gage> ,
   <http://dbpedia.org/resource/John_Klensin> ,
   <http://dbpedia.org/resource/John_Markoff> ,
   <http://dbpedia.org/resource/John_Seely_Brown> ,
   <http://dbpedia.org/resource/Tim_Bray> ,
   <http://dig.csail.mit.edu/2007/wiki/people/JoeLambda#JL> ,
   <http://dig.csail.mit.edu/2007/wiki/people/RobertHoffmann#RMH> ,
   <http://norman.walsh.name/knows/who#norman-walsh> ,
   <http://www.cs.umd.edu/~hendler/2003/foaf.rdf#jhendler> ,
   <http://www.mindswap.org/2004/owl/mindswappers#Jennifer.Golbeck> ,
   <http://www.w3.org/People/Jacobs/contact.rdf#IanJacobs> .
Figure 6.2 Information from the FOAF file of Tim Berners-Lee (Turtle)

6.3 SKOS

Simple Knowledge Organization System (SKOS) [81], which is recommended by W3C, provides a way to represent classification schemes, such as controlled vocabularies, taxonomies and thesauruses. Many of these systems share a similar structure and are used in similar applications. SKOS captured many of these similarities and made them explicit, in order to permit data sharing among these different applications. In SKOS, concepts can be identified through URIs, with labels in one or more languages, and documented with different types of notes. Different concepts can be semantically interrelated, in informally associated hierarchies and networks, and also grouped in conceptual schemes. Following are some of the main elements of SKOS [82].

• Concept
This is the fundamental element of SKOS. It is a class that defines a certain resource as a concept.

@prefix ex: .
ex:animals rdf:type skos:Concept.

• prefLabel, altLabel
These are labels that make reference to concepts in natural language: “prefLabel” is the preferred label to be displayed and “altLabel” is an alternative label, used, for example, for synonyms.

ex:animals rdf:type skos:Concept ;
  skos:prefLabel "animals" ;
  skos:altLabel "creatures" .

• broader, narrower
The meaning of a concept is not only defined by words in the natural language of its labels, but also by its links with other concepts from the vocabulary: “broader” indicates that one concept is broader than another (one concept encompasses the other concept); “narrower” is the opposite of “broader”.

ex:animals rdf:type skos:Concept ;
  skos:prefLabel "animals" ;
  skos:narrower ex:mammals .
ex:mammals rdf:type skos:Concept ;
  skos:prefLabel "mammals" ;
  skos:broader ex:animals .

• related
Indicates an associative relationship between two concepts.

ex:ornithology rdf:type skos:Concept ;
  skos:prefLabel "ornithology" .
ex:birds rdf:type skos:Concept ;
  skos:prefLabel "birds" ;
  skos:related ex:ornithology

• note
Indicates a note in relation to the concept: “note“ indicates a generic note, but different types of notes can be qualified using “scopeNote,“ “historyNote,” “editorialNote“ and ˜changeNote,˜ related to scope, history, editorial issues and changes made, respectively.

ex:microwaveFrequencies
  skos:scopeNote "Used for frequencies between 1GHz to 300Ghz"@en .
ex:tomato
  skos:changeNote "Moved from 'fruits' to 'vegetables'" .

• definition
Provides a definition of the concept.

ex:documentation rdf:type skos:Concept ;
  skos:definition "the process of storing and retrieving information in all fields of knowledge" .

• ConceptScheme
Concepts can be created and used as independent entities. However, concepts generally come in carefully organized vocabularies, such as thesauruses or classification schemes. These schemes can be represented in SKOS through the “ConceptScheme” class.

ex:animalThesaurus rdf:type skos:ConceptScheme ;
  dct:title "Simple animal thesaurus" ;
  dct:creator ex:antoineIsaac .
ex:mammals rdf:type skos:Concept ;
  skos:inScheme ex:animalThesaurus .
ex:cows rdf:type skos:Concept;
  skos:broader ex:mammals ;
  skos:inScheme ex:animalThesaurus .
ex:fish rdf:type skos:Concept ;
  skos:inScheme ex:animalThesaurus .

SKOS also provides ways to do mappings between different conceptual schemes, based on the relation between the different concepts in these schemes. This mapping qualifies the degree of proximity of these relations.

6.4 Schema.org

Schema.org provides a collection of vocabularies that can be used to embed metadata in web pages, and are understood by the main search engines: Google, Microsoft, Yandex and Yahoo!. The metadata can be embedded using microdata, RDFa or JSON-LD.

There are currently over 100 vocabularies defined in Schema.org, according to a hierarchical structure. Each of the vocabularies defines a type. “Thing” is the root type and is relative to any generic item. An item from the type “Thing” accepts properties such as “name,” “description,” “image” and “URL”. On the second level of the hierarchy, there are ten specialized types, each with its own vocabulary:

• Action

• BroadcastService

• CreativeWork

• Event

• Intangible

• MedicalEntity

• Organization

• Person

• Place

• Product

Each of these types has its own specializations, such as:

• Organization

• Airline

• Corporation

• EducationalOrganization

• GovernmentOrganization

• LocalBusiness

• NGO

• PerformingGroup

• SportsOrganization

The Schema.org website [83] presents each of the vocabularies in detail, along with examples and encodings in microdata, RDFa and JSON-LD. Figures 6.3 and 6.4 present an example using the vocabularies for people and addresses.

Jane Doe
<img src="janedoe.jpg" alt="Photo of Jane Joe"/>
Professor
20341 Whitworth Institute
405 Whitworth
Seattle WA 98052
(425) 123-4567
<a href="mailto:jane-doe@xyz.edu">jane-doe@illinois.edu</a>
Jane’s home page:
<a href="http://www.janedoe.com">janedoe.com</a>
Graduate students:
<a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a>
<a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a>
Figure 6.3 Example of use of vocabularies from Schema.org (without metadata)
<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">Jane Doe</span>
  <img src="janedoe.jpg" itemprop="image" alt="Photo of Jane Joe"/>
  <span itemprop="jobTitle">Professor</span>
  <div itemprop="address" itemscope
     itemtype="http://schema.org/PostalAddress">
   <span itemprop="streetAddress">
     20341 Whitworth Institute
     405 Whitworth
   </span>
   <span itemprop="addressLocality">Seattle</span>,
   <span itemprop="addressRegion">WA</span>
   <span itemprop="postalCode">98052</span>
  </div>
  <span itemprop="telephone">(425) 123-4567</span>
  <a href="mailto:jane-doe@xyz.edu"
     itemprop="email">jane-doe@xyz.edu</a>
  Jane’s home page:
  <a href="http://www.janedoe.com"
     itemprop="url">janedoe.com</a>Graduate students:
  <a href="http://www.xyz.edu/students/alicejones.html"
     itemprop="colleague">Alice Jones</a>
  <a href="http://www.xyz.edu/students/bobsmith.html"
     itemprop="colleague">Bob Smith</a>
</div>
Figure 6.4 Example of use of vocabularies from Schema.org (microdata)

6.5 PROV

When data is published, there are metadata sets involving indirect aspects in relation to the data itself, such as the data license and rights. It is not metadata that seeks to describe the structure or semantics of the information, strictly speaking. An important type of information about published data refers to the provenance of the data: Who generated it, how it was generated, and what the sources were.

Provenance is information about entities, activities and people involved in producing a certain thing, which can be used to form assessments about things like its quality, reliability or trustworthiness. The PROV document ontology [84] defines a model, corresponding serializations and other definitions to enable the exchange of information coming from the Web.

The provenance model defined by PROV takes into account three basic elements: entities, activities and agents. These three elements are connected through a set of relationships. For example, "an entity (a web page, file, etc.) was generated by an activity associated with a particular agent.” Figure 6.5 presents a diagram with the basic relationship between the three elements. The diagram uses a graph form defined by PROV to represent the ontology.

fig6_5_entidade_atividade_e_agente_PROV.png
Figura 6.5 - Entidade, Atividade e Agente (PROV)

As a way of illustrating the basic principles of the ontology, let us assume the following generation of data [85]:

Derek, who works in the company “Chart Generators Inc”, made a composition of information from a set of data and a list of geographical regions, and based on the results of this composition he generated an illustration in the form of a graph.

This description may be represented by the diagram in Figure 6.6 and the representation in Turtle in Figure 6.7.

fig6_6_exemplo_de_prov_diagrama.png
Figura 6.6 - Exemplo de PROV (diagrama)
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix exc: <http://examplec.com/> .
@prefix exg: <http://exampleg.com/> .
exg:dataset1       a prov:Entity .
exc:regionList     a prov:Entity .
exc:composition1   a prov:Entity .
exc:chart1         a prov:Entity .
exc:compose1       a prov:Activity .
exc:illustrate1    a prov:Activity .
exc:compose1         prov:used exg:dataset1 ;
    prov:used exc:regionList1 .
exc:composition1    prov:wasGeneratedBy exc:compose1 .
exc:illustrate1    prov:used exc:composition1 .
exc:chart1    prov:wasGeneratedBy exc:illustrate1 .
exc:compose1    prov:wasAssociatedWith exc:derek .
exc:illustrate1    prov:wasAssociatedWith exc:derek .
exc:chart1    prov:wasAttributedTo exc:derek .
exc:derek   a prov:Agent ,
    prov:Person ;
    foaf:givenName "Derek"^^xsd:string ;
    foaf:mbox <mailto:derek@example.org> .
exc:chartgen   a prov:Agent ,
    a prov:Organization ;
    foaf:name "Chart Generators Inc" .
exc:derek   prov:actedOnBehalfOf exc:chartgen .
Figure 6.6 Example of PROV

6.6 DCAT

Search engines were the first applications that arose from the Web. It is natural that in an environment where a lot of information is published there would be a way to catalog it and enable searches to be performed. Increased promotion of the idea of openness and publishing of data requires having a way to catalog datasets and distribute them.

DCAT [86], a recommendation from W3C, permits the creation of catalogs with descriptions of datasets. Using a standard form for description of catalogs enhances discovery capability and enables applications to find metadata distributed in different catalogs. It also allows for decentralized publishing of catalogs and facilitates federated searches of datasets published on different websites.

DCAT has three main classes:

• dcat:Catalog – represents the catalog.

• dcat:Dataset – represents a dataset in the catalog.

• dcat:Distribution – represents a way to access datasets, such as a web page, downloadable file, web service, web API, SPARQL endpoint, etc.

Figure 6.8 Diagram of the DCAT classes

fig6_8_diagrama_de_classes_de_dcat.png
Figura 6.8 - Diagrama de classes de DCAT

Figure 6.9 presents a catalog example with two datasets represented in DCAT.

@base <http://example.org/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
:catalog a dcat:Catalog ;
  dct:title "Imaginary Catalog" ;
  rdfs:label "Imaginary Catalog" ;
  foaf:homepage <http://example.org/catalog> ;
  dct:publisher :transparency-office ;
  dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;
  dcat:dataset :dataset-001 , :dataset-002 .
:transparency-office a foaf:Organization ;
  rdfs:label "Transparency Office" .
:dataset-001 a dcat:Dataset ;
  dct:title "Imaginary dataset" ;
  dcat:keyword "accountability" , “transparency" , "payments" ;
  dct:issued "2011-12-05"^^xsd:date ;
  dct:modified "2011-12-05"^^xsd:date ;
  dcat:contactPoint <http://example.org/transparency-office/contact> ;
  dct:temporal <http://reference.data.gov.uk/id/quarter/2006-Q1> ;
  dct:spatial <http://www.geonames.org/6695072> ;
  dct:publisher :finance-ministry ;
  dct:language <http://id.loc.gov/vocabulary/iso639-1/en> ;   dct:accrualPeriodicity <http://purl.org/linked-data/sdmx/2009/code#freq-W> ;
  dcat:distribution :dataset-001-csv .
:dataset-001-csv a dcat:Distribution ;
  dcat:downloadURL <http://www.example.org/files/001.csv> ;
  dct:title "CSV distribution of imaginary dataset 001" ;
  dcat:mediaType "text/csv" ;
  dcat:byteSize "5120"^^xsd:decimal .
:dataset-002 a dcat:Dataset ;
  dcat:landingPage <http://example.org/dataset-002.html> ;
  dcat:distribution :dataset-002-csv , :dataset-002-xml ;
:dataset-002-csv a dcat:Distribution
  dcat:downloadURL <http://example.org/files/dataset-002.csv> ;
  dcat:mediaType "text/csv" .
:dataset-002-xml a dcat:Distribution ;
  dcat: downloadURL <http://example.org/files/dataset-002.xml> ;
  dcat:mediaType "text/xml" .
:catalog dcat:themeTaxonomy :themes .
:themes a skos:ConceptScheme ;
  skos:prefLabel "A set of domains to classify documents" .
:dataset-001 dcat:theme :accountability .
:accountability a skos:Concept ;
  skos:inScheme :themes ;
  skos:prefLabel "Accountability" .
Figura 6.79 - Exemplo de catálogo

The Open Data Institute website [87] presents an example [88] of how to create a catalog in a simple way, by creating a web page with links for downloading files. In this case, the metadata with information from the catalog is embedded in the HTML code, using RDFa.

The project show me the money [89] provides information about the lending market in the United Kingdom. Figures 6.10 and 6.11 present the page of the project where it is possible to download the data [90], and an excerpt of the HTML code of this page, with information from the catalog in DCAT embedded in RDFa.

Fig6_10_Projeto_show_me_the_money_pagina_de_download_de_arquivos.png
Figura 6.10 - Projeto "show me the money" (página de download de arquivos)
<!DOCTYPE html>
<html prefix="dct: http://purl.org/dc/terms/
     rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
     dcat: http://www.w3.org/ns/dcat#
     odrs: http://schema.theodi.org/odrs#">
...
<div resource='http://p2p.labs.theodi.org/download'
     typeof='dcat:Catalog'>
  <p>Want to get at the data for this project? As well as the full data,
     we've also got data files for specific regions:</p>
  <span content='http://id.loc.gov/vocabulary/iso639-1/en'
     property='dct:language'></span>
   <div property='dcat:dataset'
     resource='http://p2p.labs.theodi.org/download/#full'
     typeof='dcat:Dataset'>
    <h3 property='dct:title'>Full data</h3>
    <div property='dcat:distribution' typeof='dcat:Distribution'><ul>
     <li><strong>Format</strong><span
         content='text/csv'
         property='dcat:mediaType'>CSV</span></li>
     <li><strong>Size</strong><span
         content='240585277' datatype='xsd:decimal'
         property='dcat:byteSize'>229MB</span></li>
     <li><strong>Coverage</strong><span
         content='http://dbpedia.org/resource/United_Kingdom'
         property='dct:spatial'>UK</span></li></ul>
    <p><a class='btn btn-primary'
     href='http://4feb814f800c80231150-8876dec7442c825b72049e4e2a169344.
r56.cf3.rackcdn.com/complete.no.postcodes.csv.zip'
     property='dcat:accessURL'>Download the full dataset</a></p>
   </div>
  </div>
  ...
Figure 6.11 Example of catalog embedded in a web page (RDFa)