3. Data Ecosystem on the Web

3.1 Actors and Roles

The Web of Documents is basically used by two users: publishers and consumers. The first browser created by Tim Berners-Lee not only enabled access to a document by specifying a URL, but also allowed users to edit and record information, along the lines of Wikipedia. Thus, the same person, the same actor in this ecosystem, could play two roles: data consumer and publisher. However, as a general rule, these two roles are played by different groups of people: one set of people whose job is to publish documents and another set of people interested in consuming these documents. These two roles, in turn, can be divided into more specific sets of roles.

To illustrate the diversity of roles in a data publishing environment, we will use an analogy between the Web and the world of books. In the latter, there are two basic roles: writers and readers. In order for a reader to have access to a text created by a writer, the writer needs to publish the book. This gives rise to a number of new roles, all grouped under one major role, which we shall call publishers. Besides the actual writer, there are people responsible for ensuring that a text created by a person can be printed and distributed in order to reach readers. People are also needed to handle the publication infrastructure, promotion, and distribution, as well as designers to decide on the cover, type of paper, font and publishing formats, etc. Only then can the book arrive in bookstores to be sold to readers, the consumers of the books. It is easy to see that there are various actors who have various skills playing various roles throughout this process. In terms of readers or consumers, there are also two types of readers: those interested only in consuming the book's information and those who could also be publishers. The latter would be interested in generating a new publication based on the information consumed in the book.

In the world of the Web, these two large groups also exist: data publishers and consumers. As for publishers, we can list various actors with various roles related to the publication of information, often defined by rules or procedures that are beyond the scope of the functions of the actual creator or publisher of the data. This set of roles can involve various actors within a company or a government agency, where different sectors may be responsible for defining various types of information and processes, such as: defining licenses; creating rules to define the format of the URIs (a broader concept than URL); choosing data formats and platforms for information distribution; determining the required set of metadata and documents; defining strategies to ensure data continuation, preservation and archiving, etc..

In terms of data consumers, we can identify people interested in direct consumption of the data (end users) and people interested in transforming the data and adding value to it in order to publish another dataset. This second group of people comprises developers of intermediate applications between users and this new dataset. End users often manually add different datasets to achieve another desired result, in a task not performed by any specific application. The combination of data is infinite, and depending on the demand for a specific type of combination, it may be useful to develop a particular application that can automate this process and facilitate the work of end users.

The next section – the life cycle of data on the Web – will address the various roles and competencies needed to execute the tasks related to data publishing.

3.2 Life cycle

The data life cycle is the process of managing information, extending from the selection stage until it is archived. This process involves the participation of various professionals with specific skills related to each of the stages. As we saw in the previous section, there are various specialized roles within the group of data publishers and consumers.

There are different models that seek to encompass all the stages of the life cycle and the relationships between them. Some of the stages taken from these models are shown below.

• Planning: The initial stage, where publishing the data is planned, which includes selection of the data to be published, establishing the teams and bodies that will participate in the process, collection options and publication platforms.

• Structuring: The stage where the structure of the data to be distributed is defined; in the case of tables, that would be which fields will be displayed and the characteristics of those fields. In the case of the Semantic Web and Linked Data, that also includes ontologies that will be used for the instances of the data (Section 5). In this stage, a new ontology will also be defined, if necessary.

• Creation and Collection: The stage of acquiring the data, which includes data to be created as well as existing data obtained from spreadsheets, databases, APIs, web services, etc.

• Refinement and Enrichment (Transformation and Integration): The stage where the data is worked on to improve its quality by filtering out possible inaccuracies, adding or removing information and making any links with data from other bases.

• Formatting: The stage where the data to be published is formatted according to the platform chosen for publication.

• Description (Metadata, Documentation): The stage for defining, creating and collecting the metadata and documents that need to be added to the data, in order to facilitate understanding of the information.

• Publication: The stage in which the data is actually posted on the Web, on the platform chosen for publication.

• Consumption: The stage where data is consumed by end users or application developers.

• Feedback: The stage where information related to use of the data is collected, which can come from things like data consumers or distribution platforms.

• Archiving: The stage where the data is removed from the Web.

3.3 Architecture

In the early stages of the Web, the basic task requested of a web server was to obtain a static page (identified in the string of the URL) that was stored in an archive and coded in HTML. As the development of applications in the Programmable Web expanded, this URL started being used to solicit the execution of other types of tasks, such as requests for information on the prices of a product in various online stores. This information is built dynamically by running the application on the web server, which can compile data on the prices of the product through the various means of collecting data listed above. In another example, the task requested could be the transfer of a sum of money from one bank account to another. In this case, the requested task would not be to obtain data from a page; instead, its return would indicate whether the transfer of funds was successful.

The web operating system is based on the client-server model, where one application asks another application to perform a task. Server-side access to data is always mediated by an application. Even the request for a static HTML page in a server is displayed by a server-side application that responds to the request made by the client-side application. Therefore, any data consumed by a client-side application, such as a browser, has an intermediate application that has access to the data and returns it to the requester.

Among the many advances and technologies that have been added to the web operating system, three are worth noting in the history of data manipulation: JavaScript, XMLHttpRequest and Ajax. As already mentioned, at the beginning of the Web, all the data contained on a page displayed by a browser was returned in HTML code, resulting from the processing performed by the web server after receiving a request from a URL. No programming was performed client-side by the browser except interpretation of the HTML code to display the page to the user. Once a page had been loaded and displayed, any data processing that was needed had to be done by the server, through a new request. For example, an error in a form field was only detected after sending the data to the server and the return of a new full page, often practically the same as the previous page, except for an additional message noting the errors detected.

New page requests caused a time delay due to the need to use the network for a new communication and data transfer. Once code could be executed client-side by means of a language originally called LiveScript, then renamed JavaScript, this enabled a number of manipulations to be resolved without needing a request to the server. Once code could be processed client-side, procedures such as form field consistency started being executed without the need for new requests. As new HTML code specifications became more sophisticated, page design also underwent a change, with richer presentation, based on the idea of dynamic pages in HTML (DHTML), where the JavaScript code embedded in web pages began to manipulate the structure of the data presented.

Even so, the programming executed client-side could only manipulate data returned by the server after a request from a URL. If more data was needed, it was necessary to request another web page, since the JavaScript code executed inside a page could not communicate with the outside world. XMLHttpRequest overcame this limitation by allowing JavaScript code contained in the web pages of a browser to access more data from the server when necessary.

Google quickly realized the potential of these new technologies, and applications like Gmail and Google Maps took advantage of them to build richer user interfaces similar to an application – a web application rather than a web page. With Gmail, the application is continuously checking with the server for new e-mails. If there are any, the page is updated without needing to load a new page. Similarly, Google Maps allows a person to inspect a map, and only the necessary parts are requested from the server.

This new technique was called Ajax (Asynchronous Javascript And XML) (Figure 3.1) in article by Jesse James Garrett [8], and the term was immediately adopted. The technique started being used extensively and several JavaScript toolkits emerged that made its use even easier and more intuitive.

This model for building an application through combining data can be viewed as an application that requests data through specification of URLs where the return is no longer an HTML code, but a dataset that will be manipulated by the application and appropriately presented to the user, depending on the task requested by the user. This model can be replicated for building applications within the environment of data published on the Web, including data published according to the concepts of the Semantic Web.

Figura 3.1 ‒ Modelo de aplicações utilizando Ajax 3

The Web can be seen as a set of data layers and application layers run on the basis of requests identified by URLs. A request to a server can be considered a request to perform a task, where a URL is passed along as input information to be interpreted for execution of this task and return of the results. The data resulting from a request is always displayed by an application that can manipulate data from files, databases, data extracted from web pages, or data resulting from web service calls or web API calls. In the simplest case, which gave rise to the Web, the server-side application is only a server for pages stored in a directory structure. In a more dynamic configuration, this application can execute a code that accesses data in a database and returns it formatted in HTML. In a more sophisticated setting, this application can behave as a client for other applications, and return data resulting from specific manipulation of the data, obtained from requests made to other applications (web services and web APIs).

From the perspective of machines, the Web of Data can be viewed as a data layer consumed by an application layer. The data generated by these applications, in turn, can be used as data sources for other applications. We will then have a new data layer (generated by applications) to be consumed by a new application layer. And so on. So, we have a universe of data and application layers that can infinitely build new layers, with each application providing a specific set of tasks. Interpretations of the data and its possible meanings are the responsibility of the applications, resulting from the relationships they make through consumption of the universe of data organized in various layers.

From the perspective of people, applications must provide interfaces where the data resulting from the requested tasks is presented in a way that makes it easy to understand, which is the job of interface designers. For example, in the case of large datasets, features such as visual presentation of graphs and tables in various formats may facilitate such understanding. It is not within the scope of this guide to explore the various ways that data can be displayed in user interfaces.

Thus, we have an architecture where data is published in various ways and there is a set of applications that provide a set of tasks and manipulate this dataset in order to generate their results. The distribution of this data is heterogeneous and enables access to data contained in web pages, downloads of files stored in directory structures on servers, and data returned by web services and web APIs. There is no standardization in relation to the structure and semantics of the data distributed. These issues are resolved by the applications themselves. The understanding of the structure and semantics of the data is embedded in the applications.

The idea of the Semantic Web is to define a standard model for representing data, combined with a set of commonly used vocabularies, that enables the semantics of the data to be appended to the data, in order to build a more homogeneous environment in the data layers and to facilitate the work of application developers.

As we will see in the next section, the Semantic Web (Linked Data) adds two new forms of data distribution to this architecture:

• Resource Dereferencing: Data, resources represented in the Semantic Web data model, is requested directly through a URI (Section 4.1.2).

• SPARQL endpoint: An access protocol and query language that enables a set of resources to be requested through specification of a query against a database.

Currently, the intrinsic nature of the web environment is to have a heterogeneous structure in relation to data publishing and distribution. The most important aspect with regard to open data is that data be published, even in proprietary formats and even if more work is required on the part of developers and data consumers. What the creators of the Semantic Web desire is that these various forms converge over time in order to build a Web of Data, a worldwide database of linked information. On top of this Web of Data, applications will be able to build their set of tasks to meet the demands of users. Section 5.2 presents five stages in this path toward open data.