2. WWW and Semantics

Nowadays, for a large percentage of the world’s population, the World Wide Web is as common as television. The Web is a place where people consult information, make purchases, work, talk, establish relationships, and much more. The Web is something everyday, obvious. Is there any need to explain what the Web is to a 10-year-old born in a big city? For those born in the 21st century, the Web is the Web, just as light is light. Despite the enormous changes wrought in people’s lives by use of the Web, it is still very recent, and in constant and intense transformation. But what is the Web?

2.1 Web of Documents

In technical terms, it is a system that was invented in 1989 by Tim Berners-Lee and Robert Cailliau [1] in order to use the Internet for queries and updating items of information (documents) organized in a hypertext structure. Hypertext is a structured text, composed of an interlinked set of information units (nodes) containing, for example, texts, images, videos, and others. The architecture of this system was created based on the client-server concept, where an application (a Web client) requests a document (a resource) from another application (a Web server) by providing the identification of the document.

This system is composed of three basic elements, defined according to international standards:

• URL (Uniform Resource Locator ) [2] is the identifier of documents, or nodes, of the structure. For example, "http://pt.wikipedia.org/wiki/World_Wide_Web" identifies the document that describes the Web on the Wikipedia website in Portuguese.

• HTML (Hypertext Markup Language ) [3] is the mark-up language for describing documents. It was initially designed as a language to describe scientific documents, but adaptations over time enabled its use to describe different types of documents.

• HTTP (Hypertext Transfer Protocol ) [4] is the communication protocol for accessing documents. This protocol establishes communication rules between two applications—a client and a server—that enable specific content to be requested and the result to be returned, such as an HTML page.

The system works as follows: an application-client, such as a Web browser, requests a document from the server, identifying this document through a URL and noting that this document should preferably be described in HTML. This URL also contains the identification of the server that will be responsible for managing the document request. The communication between the client and server is established via the HTTP protocol. Once a document is requested by the browser, the server identified by the URL returns the desired document in the form of a string formatted in HTML. Once the document is received in HTML, the browser is responsible for interpreting the HTML code and displaying the information on the screen. The original idea behind this system was that, besides viewing the information contained in a given document, users would also be able to change the information, something similar to what occurs today on websites such as Wikipedia.

In addition to these three elements, another important element in this system is the link that can be created between different documents, a hyperlink in a hypertext. Links establish interconnections between different nodes of the structure. A link can be included in the code of a page, using a specific HTML tag (<a href=”url-do-link”>), where it is possible to specify a URL that points to another document. As a result, we have a network, a web of interconnected documents, a hypertext. The document, when viewed by people through a browser, is called a web page. In short, every web document can be accessed via a URL, and this document can be connected to other documents through links: a set of documents connected by links. This is the Web of Documents.

The content of a page displayed by a browser can contain a variety of information. We could have a page with information about different products on sale in a certain store that sells appliances. This information is described in such a way as to be understood by people. What is understood by machines on a web page?

The browser, or other applications that run on the Internet, cannot identify what is described in these pages in the same way that people are able to understand it. The browser does not understand the meaning of the text contained in the pages. What a browser is able to understand is the HMTL semantics, which contain ways to indicate, for example, that a particular phrase is a title, a specific HTML tag, such as the <h1> tag. Therefore, we can say that a particular phrase, "Web of Documents", is a title: “<h1> Web of Documents</h1>”. What the browser can understand is the relationship between the text "Web of Documents" and the <h1> tag, and from there it has a way to display this title in a given style. It makes no difference to the browser what text is contained between the <h1> and </h1> tags. It does not understand the text. It could be any text. The browser does not extract any information from the text itself.

2.2 Programmable Web

The original web environment was basically composed of a universe of static documents kept in archives on web servers, which were requested by browsers to be displayed to users. Although a URL can simply point to a file, a web server can do more than identify a file and send it back to the client. It can run a program code and, from this computation, return content dynamically generated by the program. In this case, the HTML page returned by the server is the output of the program run according to the request identified by the URL.

Each web server runs specific software to handle HTTP requests. Usually, an HTTP server has a directory or folder, which is designated as a collection of documents or files that can be sent in response to requests from browsers. These files are identified according to the URLs connected to the requests.

Common Gateway Interface (CGI) is a standard method used to generate dynamic content on web pages and applications. It provides an interface between the web server and programs that generate dynamic web content. These programs are usually written in a script language, but can be written in any programming language.

The Web has evolved from a simple venue for displaying pages contained in static documents to a place where different types of applications use browsers as platforms for running programs. Nowadays, it is possible to shop, perform banking procedures, send messages and partake of a multitude of other applications, through the use of browsers. Various programming environments have emerged to facilitate creating and running of these applications, such as ASP, ASP.NET, JSPJava, PHP, Perl, Python, and Ruby on Rails.

Applications are generally structured in logical blocks called tiers, where each tier is assigned a role. The most common structure in web applications is based on three tiers: presentation, business logic and storage. A web browser is the first tier (presentation). An engine using some kind of dynamic web content technology (like JSPJava or Python) is the middle tier (business logic). A database is the third tier (storage). The browser sends requests to the middle tier, which runs services making queries and updates in the database, and then generates a response in the form of a user interface.

The spread of web applications created the need for communication between applications in order to enable the exchange of data and services, which gave rise to the idea of web services and web APIs, as implementations of the concept of software components in the web environment. Web services provide a standard format for interoperation between different software applications. Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (WSDL). Other systems interact with the web service using messages described in its interface, through a specific protocol (SOAP). Web APIs have a less restrictive definition in relation to the formatting of data in the communication between applications, and use another communication protocol (REST).

The general idea behind ​these two technologies is to enable applications to provide services to be consumed by other applications, which results in another layer within the web environment. In this layer, data traffic is being exchanged by applications, which can be manipulated, combined and transformed, depending on the tasks offered by each web application, and then presented to users.

In order to illustrate the idea of ​components, let us consider, for example, a store that sells over the Internet and needs to give users the shipping costs for their products. In general, shipping is done by an outsourced company. So that the online store can provide this cost, the application of the website can obtain this data through a service (web service or web API) provided by the company that will do the shipping and display the cost in the interface presented to users. This data traffic between the application of the online store and the web service provided by the shipping company is invisible to the buyers of the products. The buyers only notice the communication between themselves and the online store.

One problem that arises in this web component architecture is that the meaning and format of the data does not adhere to any standard, and are specified for the services in the way they deem most convenient. Thus, an application that seeks to combine data from different components needs to know each of the definitions, and if it wants to integrate this data somehow, it must know how to interpret the various definitions, in order to identify the similarities and differences between the different data returned by the multiple components. For each new service wanting to be used, its semantics and how they are described need to be understood. The semantic interoperability of different services and their data have to be performed manually. The website of the The New York Times provides a set of over 10 different APIs [5] to access its data, each with its own specification and data format. The ProgrammableWeb [6] website [6] contains a catalog with thousands of highly varied and heterogeneous applications available, clearly illustrating the diversity in the world of the programmable web.

2.3 Web of Data

Web pages displayed by a browser contain a set of information that is consumed by individuals. They include texts, photos, videos, etc., arranged on a page, so that a person can extract meaning from this information. This information generally groups together a set of data that has a certain interrelationship and where, for some reason, it makes sense to present them on a single page or document. According to the request from a URL, the web server identifies which data will be returned to the browser.

In a presentation by Tim Berners-Lee [7] “Open, Linked Data for a Global Community,” he uses a bag of chips as an example of the diversity of information that the package of a product provides: nutrition facts, chemical composition, quality seals, bar code identification, ways to contact the manufacturer, etc. This information is described using specific vocabularies that require prior knowledge to be understood. For example, you need to know how to read a nutrition facts label, understand that the label refers to a specific portion and that percentages are listed in relation to the recommended daily needs for human consumption. The package contains different data grouped into a single document. In fact, there may be a lot more information that could be provided about the product, but based on various criteria, such as available space or the degree of importance of the information, the decision was made to only list certain information. There is often a web address printed on the package to indicate where further information can be obtained.

Taking an example from the web: On a page for a store, a variety of information can be displayed, such as the store’s address and a customer service phone number. All this data is displayed in a single document that is presented to users, presenting text and graphic features (color, size, and type of font, etc.) in a spatial arrangement within the page in order to communicate information to users. It is a communication process. A process that understands that the receivers of this message are human beings. As we saw earlier, the initial model of the Web understands this network as a set of interconnected documents in a hypertext structure. When a request is sent to a web server, it identifies which set of data will be grouped on a given page. Direct individual access to the data is not permitted.

And if instead of considering these documents as separate blocks of data, we were to think of a web that could enable individual access to all the data grouped in these pages, then in addition to the Web of Documents, we would have access to a Web of Data, where each of the nodes of the Web was no longer necessarily a document, but could be specific data or a particular resource. This would give us access to a finer layer of granularity of the Web, and different developers would be able to create applications that group this data in different ways. In the example of the bag of chips, a particular application would be able to list another set of data related to the product, according to a specific criterion deemed more useful from another perspective, for example, as food for people with hypertension.

Search engines are one of the most popular applications of the Web. Google has become one of the largest companies on the planet because of the need to connect published data with potential consumers of such data, serving as an intermediary in the communication process. To provide this service, web pages are scanned and analyzed by robots so that Google can assemble a database that is able to respond as precisely as possible to search requests. Search engines have a limited understanding of what is being discussed on these web pages. Since each page contains different data, it is necessary to use algorithms to extract this data from information formatted for human beings. How do you identify all this data without a specific indication that can be understood by machines? How do you format information that might be useful so that machines will understand the meaning of information contained on a web page? How do you create new ways to distribute such data?

The following sections explain the concepts of semantics and metadata that are used to achieve the idea of ​a Web of Data understood by both humans and machines.

2.4 Semantics

In this guide, we want to shed light on the idea of adding the definition of a semantic layer to the initial model of the Web of Documents. As we saw earlier, the initial idea of ​the Web was to serve as a way to navigate among documents arranged in a hypertext structure. These documents are displayed to users by applications that interpret HTML language. The content of the pages is seen by the machines in a purely syntactic form. The information itself is interpreted by the people who view the pages. What are the semantics of the information displayed? What is the meaning of this information?

In linguistics, semantics is the study of meaning used to understand human expression through language. We are able to understand the meaning of this sentence by understanding the meaning of each word and the relationship between these words within the sentence. We also understand punctuation marks, such as commas, periods, etc. and their function in the text. In addition, we can understand images and color-coding and a variety of coded signs in different ways. A piece of information can, for example, be displayed in large red letters to indicate the need to pay attention to that text. It all depends on various factors, including the culture of people receiving the information. Red, for example, has a different meaning in Asian cultures.

Communication models are conceptual models used to explain the process of human communication (Figure 2.1). The first major communication model was designed in 1949 by Claude Shannon and Warren Weaver from Bell Laboratories. Communication is the process of transferring information from one party (transmitter) to another (recipient). Shannon and Weaver’s original model consisted of three main parts: transmitter, channel, and receiver. In a simple model, information (e.g., a message in natural language) is sent in some form (such as spoken language) from a transmitter/sender/encoder to a receiver/recipient/decoder through a channel. This concept of regular communication sees communication as a means of sending and receiving information.

fig2_1_comunicacao_entre_pessoas.svg
Figura 2.1 – Comunicação entre pessoas 1

Figures 2.2 and 2.3 illustrate how a piece of information embedded within the mental model of a person is encoded, to then be transmitted via a channel and then decoded and mapped to the mental model of another person.

In this guide we are introducing semantics for machines that access information. How can a machine interpret the contents of a web page? How can information be encoded to ensure correct mapping between intended meaning and understood meaning?

fig2_2_modelo_mental.png
Figura 2.2 – Modelo mental 2
fig2_3_a_transmissao_do_significado.png
Figura 2.3 - A transmissão do significado

Let us take the sample of price comparison websites. As soon as online stores emerged on the Web, offering products on their pages with information that included product descriptions, images, prices, etc., the idea arose of building websites that could provide users with price comparisons of the same product offered in different online stores. Price comparison websites can be considered the first examples of the Semantic Web, but the first systems developed in these websites used specific software (scrapers) to extract structured information about the products from web pages. The information was extracted through the identification of syntactic patterns within the HTML pages.

HTML is a text markup language that basically defines the structure of a text, and another language (CSS) defines text style. The browser combines the structure and style and displays the information to people. In order for a scraper to understand the meaning of a text, it is necessary to find patterns that somehow indicate a meaning. Here's an example of a Brazilian shopping website, where a <type-price> pattern can be identified: a string in the form "xxx, xx" next to the characters "R$." In HTML language there is no tag that explicitly indicates that this string is a price. Human beings can easily recognize that this information is a price. A little more sophisticated example contains the following text: "Google Chromecast HDMI Streaming: R$249.00 for R$192.72,” next to an image. We can write a code that looks for the word "for," next to numbers that can be identified in the <type-price> pattern and understand that it is the sales price of a product. A <type-promotion> pattern is thus defined. And so on. It is easy to note that changes in the arrangement and grouping of information may require reprogramming the scraper. Building a scraper requires extensive programming and it is a very unstable system, since reprogramming may be needed every time an online store changes the structure of its information.

As we saw in Section 2.3 (Web of Data), the idea is to be able to identify, in an individual and machine-readable way, each piece of data grouped on web pages. For this purpose, we need to put extra information about the data into the HTML code, and this information will be consumed by machines. This information about data is called metadata.

2.5 Metadata

Metadata is data about data. It provides additional information about data to help application developers and end users better understand the meaning of the published data and its content and structure. Metadata is also used to provide information on other issues related to the dataset, such as the license, the company/organization that generated the data, data quality, provenance, how to access, update frequency of the set of information, etc. The purpose of metadata is to assist in the communication process between data publishers and consumers, so that the latter can understand all the issues related to the use of such data.

Metadata can be used to help with tasks, such as discovery and reuse of datasets, and can be attributed according to different granularities, ranging from a single property of a resource (a column of a table) to a complete dataset or all the datasets of a particular company/organization. A simple example involving the ordinary use of metadata is the names of the columns of a table placed in the first line of a file in CSV format. The function of this metadata is to enable a reader of the data from this CSV file to understand the meaning of each field, the data and each line.

Up to the present, the Web has developed more rapidly as a means of transmitting documents to people, rather than data and information that can be processed automatically. Metadata needs to be available in readable forms for both humans and machines. It is important to provide both forms of metadata, in order to reach humans and applications. In the case of machine-readable metadata, the use of reference vocabularies should be encouraged, as a way to strengthen a common semantics.

In the example of the CSV table, it is possible to see that difficulties may occur when a common reference vocabulary to describe metadata is not used. Each organization and person might use a different term for the name of the columns in a table, which can be understood within a particular company, but may have an ambiguous meaning for different people in different companies, and often within the same company. Some of the most popular reference vocabularies are presented in Section 6. For example, provenance of data could be described using PROV-O, a W3C recommendation which provides a set of classes, properties and restrictions that can be used to represent and exchange provenance information generated in different systems and in different contexts.

Metadata can be of different types. These types can be classified into different taxonomies, grouped by different criteria. For example, a specific taxonomy could define metadata according to descriptive, structural and administrative characteristics. Descriptive metadata serves to identify a dataset, structural metadata is used to understand the format in which the dataset is distributed, and administrative metadata supplies information about the version and update frequency, etc. Another taxonomy might define other types of metadata with a schema based on the tasks for which the metadata is used, such as discovery and reuse of data.

Metadata can be embedded in web pages, merged (within the HTML code) with the information to be displayed to users. This way, part of the HTML code is for human consumption and the other part for machine consumption. Section 4.5 introduces technologies used to embed metadata in web pages. In addition, metadata can be stored in catalogs that keep information about data published on the Web. Metadata can also be consumed through implementations that use technologies related to the Semantic Web and Linked Data presented in sections 4 and 5.

One of the first ways metadata was included on a web page was through the HTML tag. This tag has two attributes that enable a name and content to be defined. One of the first uses of the <meta> tag was as a form of communication between web page publishers and search engine robots that scan those pages. These robots read the pages in order to generate an index that will serve as a basis for responses to search requests from users. One of the various kinds of information that the <meta> tag can communicate to robots is whether a page should or should not be included in search engine indexes. This allows a publisher to inform robots that it does not want to appear in search results:

<meta name="robots" content="noindex" />

The <meta> tag was also intensively used to provide search engines with a way to index pages according to a set of keywords defined by the publisher. For example:

<meta name="keywords" content="guide, semantic web" />

As humans who understand the English language, we can infer that there is a set of keywords associated with the page where this tag was included. However, the semantics of the <meta> tag do not define any specific interpretation for the attributes "name" and "content." This information is interpreted by applications that read pages according to semantics that have been established through use. In the case of the example, a <meta> tag with "name ="keywords"' is interpreted by search engine robots as indicating that the "content" field will have a list of keywords that will be associated with the page content. For such conventions, like "keywords" meaning a set of keywords, it is necessary to use reference vocabularies whose meaning is understood by the publisher, so that the meaning understood by the application that consumes the page will be the same as the intended meaning.

This way of including metadata has very limited expressiveness and has been extended through the creation of standards to define metadata and by a set of reference vocabularies, which will be described in the following sections of this guide.