Wednesday, February 22, 2017

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Examples: Archive Now (archivenow) CLI
A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite,, and For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:
$ archivenow --all

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:
  • A web page can be pushed into one archive
  • A web page can be pushed into multiple archives
  • A web page can be pushed into all archives  
  • Adding new archives
  • Removing existing archives
Install Archive Now from PyPI:
    $ pip install archivenow

To install from the source code:
    $ git clone
    $ cd archivenow
    $ pip install -r requirements.txt
    $ pip install ./

"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

   1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
   $ archivenow -h
   usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]] 

                        [--host [HOST]][--port [PORT]][URI]
   positional arguments:
     URI                   URI of a web resource
   optional arguments:
     -h, --help            show this help message and exit
     --cc                  Use The Archive
     --cc_api_key [CC_API_KEY]
                           An API KEY is required by The

     --ia                  Use The Internet Archive
     --is                  Use The
     --wc                  Use The WebCite Archive
     -v, --version         Report the version of archivenow
     --all                 Use all possible archives
     --server              Run archiveNow as a Web Service
     --host [HOST]         A server address
     --port [PORT]         A port number to run a Web Service

To archive the web page ( in the Internet Archive:

$ archivenow --ia

By default, the web page (e.g., will be saved in the Internet Archive if no optional arguments provided:

$ archivenow

To save the web page ( in the Internet Archive ( and The

$ archivenow --ia --is

To save the web page ( in all configured web archives:

$ archivenow --all --cc_api_key $Your-Perma-CC-API-Key

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia

   2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
  * Running on (Press CTRL+C to quit)

To save the web page ( in The Internet Archive through the web service:

$ curl -i

     HTTP/1.0 200 OK
     Content-Type: application/json
     Content-Length: 95
     Server: Werkzeug/0.11.15 Python/2.7.10
     Date: Thu, 09 Feb 2017 14:29:23 GMT

      "results": [

To save the web page ( in all configured archives though the web service:

$ curl -i

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 172
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Thu, 09 Feb 2017 14:33:47 GMT

      "results": [
        "Error (The Archive): An API KEY is required"

you may use the API_Key as following:

$ curl -i$Your-Perma-CC-API-Key

   3. Python Usage

>>> from archivenow import archivenow

To save the web page ( in The WebCite Archive:

>>> archivenow.push("","wc")

To save the web page ( in all configured archives:

>>> archivenow.push("","all")
['','','','Error (The Archive): An API KEY is required]

To save the web page ( in The

>>> archivenow.push("","cc","cc_api_key=$Your-Perma-cc-API-KEY")

To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()

* Running on (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., to this archive through the Python code, I should write ">>>archivenow.push("","ma")". In the file "", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*" organized.

Removing an archive can be done by one of the following options:
  • Removing the archive handler file from the folder "handlers"
  • Rename the archive handler file to other name that does not end with ""
  • Simply, inside the handler file, set the variable "enabled" to "False" 


The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture ( at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The sets this time gap to five minutes.

Updates and pull requests are welcome:

--Mohamed Aturban

Monday, February 13, 2017

2017-02-13: Electric WAILs and Ham

Mat Kelly recently posted Lipstick or Ham: Next Steps For WAIL in which he spoke about the past, present, and potential future for WAIL. Web Archiving Integration Layer (WAIL) is a tool that seeks to address the disparity between institutional and individual archiving tools by providing one-click configuration and utilization of both Heritrix and Wayback from a user's personal computer. I am here to speak on the realization of WAIL's future by introducing WAIL-Electron.


WAIL has been completely revised from a Python application using modern Web technologies into an Electron application. Electron combines a Chromium (Chrome) browser with Node.js allowing for native desktop applications to be created using only HTML, CSS, and JavaScript.

The move to Electron has brought with it many improvements most importantly, of which is the ability to update and package WAIL for the three major operating systems: Linux, MacOS, and Windows. Support for these operating systems is easily achieved by packing utility used (electron-packager) which allows one to produce the binary for a specific system. Also thanks to this move, the directory structure issue mentioned by Mat in his post has been resolved. Electron applications have their own directory structure inside the OS-specific application directory path accessible via their API. Here the packager will place the tools WAIL makes available for use.

Electric Ham

The meat of this revision is adding new functionality to WAIL in addition to the tools already made available through WAIL, namely Heritrix and Wayback. This new functionality comes in two parts. First, WAIL is now collection-centric. The previous revision, WAIL Python, added the WARC files created through WAIL to a single archive. This archive was an ambiguous collection of sorts where users had to create their own means of associating the WARCs to each other. Initially, this beneficial feature allowed users to archive what they saw at any given instance and replay the preserved page immediately. But updates to WAIL could not be justified if they did not build upon the existing functionality; which is why the concept of personal collection-based archiving was introduced.


WAIL now provides users with the ability for the curation of personalized web archive collections akin to the subscription service Archive-It except on their local machines. By default, WAIL comes with an initial collection and allows for the creation of additional collections.

The Collections screen displays the collections created through WAIL. This view displays the collection name along with some summary information about it.

  • Seeds: How many seeds are associated with the collection
  • Last Updated: the last time (date and time) the collection was updated
  • Size: How large is the collection on the file system

Creation of a collection is as simple as clicking the New Collection button available on the Collections (home) screen of WAIL. After doing so, a dialog will appear from which users can specify the name, title, and description for the collection. Once these fields have been filled in, WAIL will create the collection that users can access from the Collections View.

The Collection View displays the information about each seed contained in the collection

  • Seed URL: The URL
  • Added: The date time it was added to the Collection
  • Last Archived: The last time it was archived through WAIL
  • Mementos: The number of Mementos for the seed in the collection

along with a link for viewing the seed in Wayback.

Seeds can be added to a collection from either the live web or from WARC files present on the filesystem. To aid in the process of adding a seed from the live web, WAIL provides the user will the ability to "check" the seed before archiving.

The check provides summary information about the seed that includes the HTTP status code and a report on the embedded resources contained in the page. This lets users choose an archive configuration before starting WAIL's archival process to configure and launch a Heritrix crawl.

To add a seed from the filesystem all the user has to do is drag and drop the (W)ARC file into the corresponding interface for that functionality. WAIL will process the (W)ARC file and display a list of potential seeds discovered.

WAIL can not automatically determine the seed due to the nature of (W)ARC files. Rather WAIL uses heuristics on the contents of the (W)ARC file to determine which entries are valid candidates for the seed URL. From this display, the user chooses the correct one. WAIL will then add the seed to the collection, and it will be available for replay from the Collection View.

Twitter Archiving

The second added functionality is the ability to monitor and archive Twitter content automatically. This was made possible thanks to the scholarship I received for preserving online news. There are two options for the Twitter archival feature implemented in WAIL. The first is monitoring a user’s timeline for tweets which were tweeted after the monitoring has started with the option of selecting only the tweets containing hashtags specified during configuration. The second, a slight variation of the first, will only archive tweets that have specific keywords in the tweet’s body as specified during configuration.

What makes this unique is how WAIL preserves this content. Before this addition, WAIL utilized Heritrix as the primary preservation means. Heritrix executes HTTP GET requests to retrieve the target web page and archives the HTTP response headers and the content returned from the server. The embedded Javascript of the web page is not executed potentially decreasing the fidelity of the capture. This is problematic when archiving Twitter content since the rendering of tweets is done exclusively through client-side Javascript. 

To address this WAIL utilizes the native Chromium browser provided by Electron in conjunction with WARCreate. Modifications were made to WARCreate in order to integrate it with WAIL to eliminate the need for human intervention to decide when to generate the WARC and to work inside of Electron. By integrating WARCreate into WAIL the archival process of Twitter content has been simplified to loading the URL of the tweet into the browser and waiting until the browser indicates that the page has been rendered in its entirety. Then the archival process through WARCreate is initiated. Once the WARC has been generated, it is added to the collection specified by the user.

Putting on Lipstick

As mentioned in Mat's blog post, the UI for WAIL-Python needed an update not only for its maintainability but also for a cohesive user experience across supported platforms. At the time of starting this revision of WAIL, the choices available for the front-end framework as seen on Github were plentiful. It simply boiled down to choosing the one that had the "least" painful setup and deployment process with a learning curve such that any person taking over the project could be brought up to speed with minimal effort.

With this in mind, React was chosen for WAIL's UI library; it is unopinionated about other technologies which may be used alongside it and features a large production tested ecosystem with an active developer community. React is only a view library, which is why WAIL uses Redux and Immutable.js to complete the traditional MVC package. This React, Redux, and Immutable.js stack provide WAIL a consistent user experience across supported platforms and a much more manageable codebase. On the tools side of making WAIL look and perform beautifully, WAIL is now using Ilya Kreymer's pywb. Pywb is used by WAIL for both replay and to aid in the heavy lifting of managing the collections.

WAIL is now available from the project's release page on Github available.  For more information about how to use WAIL be sure to visit the wiki.

- John Berlin

Monday, January 23, 2017

2017-01-23: Finding URLs on Twitter - A simple recommendation

A prompt from Twitter indicating no search results
As part of a research experiment, I had the need to find URLs embedded in tweets from Twitter's web search service. Most of the URLs where much older than 7 days, so using the Twitter search API was not an option since the search is performed on a sample of tweets published in the past 7 days, so I used the web search service. 
I began the experiment by pasting URLs from tweets into the search box on
Searching Twitter for a URL by pasting the URL into the search box
I noticed I was able to find some URLs embedded in tweets, but this was not always the case. Based on my observations, finding the URLs was not correlated with the age of the tweet. I discussed this observation with Ed Summers and he recommended adding a "url:" prefix to the URL before searching. For example, if the search URL is: 
he recommended searching for
I observed that prepending search URLs with the "url:" prefix improved my search success rate. For example, the search URL: "" was not found except with the "url:" prefix.
Example of a URL that was not found except with the "url:" parameter
Example of a URL that was not found with the "url:" parameter, but found without
Based on these observations, and considering that there was no apparent protocol switching, or URL canonicalization, I scaled the experiment to gain a better insight about this search behavior. I wanted to know the proportion of URLs that are:
  1. found exclusively with the "url:" prefix
  2. found exclusively without the "url:" prefix
  3. found with and without the "url:" prefix (both 1 and 2).
I issued 3,923 URL queries to Twitter and observed the following proportions:
  1. Count of URLs found exclusively with the "url:" prefix: 1,519
  2. Count of URLs found exclusively without the "url:" prefix: 129
  3. Count of URLs found with and without the "url:" prefix (both 1 and 2): 853
  4. Count of URLs not found: 1,422
My initial non-automated tests gave the false impression that the "url:" prefix was the only consistent method to find all URLs embedded in tweets, but these tests result show that even though the "url:" prefix search method exhibits a higher hit rate, it is not self sufficient.
Consequently, to find a URL "U" via twitter web search, I recommend beginning the search with "url:U". If "U" is not found, search for U, because this promises a higher hit ratio.

Friday, January 20, 2017

2017-01-20: has been unarchivable since November 1st, 2016 has been unarchivable since 2016-11-01T15:01:31, at least by the common web archiving systems employed by the Internet Archive,, and The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40, with all versions since then producing some kind of error (including today's; 2017-01-20T09:16:50). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration.
Given the political controversy surrounding the election, one might conclude this is a part of some grand conspiracy equivalent to those found in the TV series The X-Files. But rest assured, this is not the case; the page was archived as is, and the reasons behind the archival failure are not as fantastical as those found in the show.  As we will explain below, other archival systems have successfully archived during this period (e.g,

To begin the explanation of this anomaly, let's consider the raw HTML of the memento on 2016-11-01T15:01:31. At first glance, the HTML appears normal with few apparent differences (disregarding the Wayback injected tags) from the live web when comparing the two using only the browser's view-source feature. Only by looking closely at the body tag will you notice something out of place: the body tag has several CSS classes applied to it one of which seems oddly suspicious.

<body class="pg pg-hidden pg-homepage pg-section domestic t-light">

The class that should jump out is pg-hidden which is defined in the external style sheet page.css. Its definition seen below can be found on lines 28625-28631.
.pg-hidden { display: none }
As the definition is extremely simple a quick fix would be to remove it. So let's remove it.

What is revealed after removing the pg-hidden class is a skeleton page i.e. a template page sent by the server that relies on the client-side JavaScript to do the bulk of the rendering. A hint to confirm this can be found in the number of errors thrown when loading the archived page.

The first error occurs when JavaScript attempts the change the domain property of the document.

Uncaught DOMException: Failed to set the 'domain' property on 'Document' 
'' is not a suffix of ''. at (anonymous) @ (index):8 

This is commonly done to allow a page on a subdomain to load resources from another page on the superdomain (or vice versa) in order to avoid cross-origin restrictions. In the case of, it is apparent that this is done in order to communicate with their CDN (content delivery network) and several embedded iframes in the page (more on this later). To better understand this consider the following excerpt about Same-origin policy from the MDN (Mozilla Developer Network):
A page may change its own origin with some limitations. A script can set the value of document.domain to a suffix of the current domain. If it does so, the shorter domain is used for subsequent origin checks. For example, assume a script in the document at executes the following statement:
document.domain = "";
After that statement executes, the page would pass the origin check with However, by the same reasoning, could not set document.domain to
There are four other exceptions displayed in the console from three JavaScript files (brought in from the CDN)
  • cnn-header.e4a512e…-first-bundle.js
  • cnn-header-second.min.js
  • cnn-footer-lib.min.js 
that further indicate that JavaScript is loading and rendering the remaining portions of the page.

Seen below is the relevant portion of JavaScript that does not get executed after the document.domain exception.

This portion of code sets up the global CNN object with the necessary information on how to load the assets for each section (zone) of the page and the manner by which to load them. What was not shown is the configurations for the sections, i.e the explicit definition of the content contained in them. This is important because these definitions are not added to the global CNN object due to the exception being thrown above (at window.document.domain). This causes the execution of the remaining portions of the script tag to halt before reaching them. Shown below is another inline script that is further in the document which does a similar setup.
In this tag the definitions that tell how the content model (news stories contained in the sections) are to be loaded along with further assets to be loaded. This code block does get executed in its entirety, which is important to note because the "lazy loading" definitions seen in the previous code block are added here. By defining that the content is to be lazily loaded (loadAllZonesLazy) the portion of Javascript responsible for revealing the page will not execute because the previous code blocks definitions are not added to the global CNN object. The section of code (from cnn-footer-lib.min.js) that does the reveal is seen below

As you can see the reveal code depends on two things: zone configuration defined in the section of code not executed and information added to the global CNN object in the cnn-header files responsible for the construction of the page. These files (along with the other cnn-*.js files) were un-minified and assignments to the global CNN object reconstructed to make this determination. For those interested, the results of this process can be viewed in this gist.

At this point, you must be wondering what changed between the time when the CNN archives could be viewed via the WaybackMachine and now. These changes can be summarized by considering the relevant code sections from the last correctly archived memento on 2016-11-01T13:15:40 seen below

When considering the non-whiteout archives, CNN did not require all zones to be lazy loaded and intelligent loading was not enabled. From this, we can assume they did not wait for the start of the more dynamic sections of the page to begin loading or to be loaded before showing the page.

As you can see in the above image of the memento on 2016-11-01T13:15:40, the headline of the page and the first image from the top stories section of the page are visible. The remaining sections of the page are missing as they are the lazily loaded content. Now compare this to the first not correctly archived memento on 2016-11-01T15:01:3. The headline and the first image from the top stories are a part of the sections lazily loaded (loadAllZonesLazy); thus, they contain dynamic content. This is confirmed when the pg-hidden CSS class is removed from the body tag to reveal that only the footer of the page is rendered but without any of the contents.

Even today the archival failure is happening as seen in the memento on 2017-01-20T16:00:45 seen below

In short, the archival failure is caused by changes CNN made to their CDN; these changes are reflected in the JavaScript used to render the homepage. The Internet Archive is not the only archive experiencing the failure, and are also affected. Viewing a capture on 2016-11-29T23:09:40 from, the preserved page once again appears to be an about:blank page.
Removing the pg-hidden definition reveals that only the footer is visible which is the same result as the memento from the Internet Archive's on 2016-11-01T15:01:31.
But unlike the Internet Archive's capture the capture is only the body of CNN's homepage with the CSS styles inlined (style="...") on each tag. This happens because does not preserve any of the JavaScript associated with the page and performs the transformation previously described to the page in order to archive it. This means that's JavaScript will never be executed when replayed thus the dynamic contents will not be displayed.   
WebCitation, on the other hand, does preserve some of the page's JavaScript, but it is not immediately apparent due to how pages are replayed. When viewing a capture from WebCitation on 2016-11-13T33:51:09 the page appears to be rendered "properly" albeit without any CSS styling. 
This happens because WebCitation replays the page using PHP and a frame. The replay frame's PHP script loads the preserved page into the browser; then, any of the preserved CSS and JavaScript is served from another PHP script. However, using this process of serving the preserved contents may not work successfully as seen below.
WebCitation sent the CSS style sheets with the MIME type of text/html instead of text/css. This would explain why the page looks as it does. But's JavaScript was executed with the same errors occurring that were present when replaying the Internet Archive's capture. This begs the question, "How can we preserve as is unarchivable, at least by the most commonly used means?". 
The solution is not as simple as one may hope, but a preliminary solution (albeit band-aid) would be to archive the page using tools such as WARCreate, Webrecorder or These tools are effective since they preserve a fully rendered page along with all network requests made when rendering the page. This ensures that the JavaScript requested content and rendered sections of the page are replayable. Replaying of the page without the effects of that line of code is possible but requires the page to be replayed in an iframe. This method of replay is employed by Ilya Kreymer's PyWb (Python implementation of the Wayback Machine) and is used by Webrecorder and 
This is a fairly old hack used to avoid the avoid cross-origin restrictions. The guest page, brought in through the iframe, is free set the document.domain thus allowing the offending line code to execute without issue. A more detailed explanation can be found in this blog post but the proof is in the pudding by preservation and replay. I have created an example collection through Webrecorder that contains two captures of 
The first is named "Using WARCreate" which used WARCreate for preservation on 2017-01-18T22:59:43, 
and the second is named "Using Webrecorder" which used Webrecorders recording feature as the preservation means on 2017-01-13T04:12:34. 

A capture of on 2017-01-19T16:57:05 using for preservation is also available for replay here

All three captures will be replayed using PyWb and when bringing up the console, the document.domain exception will no longer be seen. 
The CNN archival failure highlights some of the issues faced when preserving online news and was a topic addressed at Dodging The Memory 2016. The whiteout, a function of the page itself not the archives, raises two questions "Is using web browsers for archiving the only viable option?" and "How much modification of the page is required in order to make replay feasible?".

- John Berlin

Sunday, January 15, 2017

2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

Example: original URI vs. trusty URI
Based on the paper:

Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, immutable, and permanent digital artifacts for linked data. Proceedings of the European Semantic Web Conference (ESWC) pp. 395–410 (2014).

A trusty URI is a URI that contains a cryptographic hash value of the content it identifies. The authors introduced this technique of using trusty URIs to make digital artifacts, specially those related to scholarly publications, immutable, verifiable, and permanent. With the assumption that a trusty URI, once created, is linked from other resources or stored by a third party, it becomes possible to detect if the content that the trusty URI identifies has been tampered with or manipulated on the way (e.g., trusty URIs to prevent man-in-the-middle attacks). In addition, trusty URIs can verify the content even if it is no longer found at the original URI but still can be retrieved from other locations, such as Google's cache, and web archives (e.g., Internet Archive).

The core contribution of this paper is the ability of creating trusty URIs on different kind of content. Two modules are proposed: in the module F, the hash is calculated on the byte-level file content while in the second module R the hash is calculated on RDF graphs. The paper introduced an algorithm to generate the hash value on RDF graphs independent of any serialization syntax (e.g., N-Quads or TriX). Moreover, they investigated how trusty URIs work on the structured documents (nanopublications). Nanopublications are small RDF graphs (named graphs -- one of the main concepts of Semantic Web) to describe information about scientific statements. The nanopublication as a Named Graph itself consists of multiple Named Graphs: the "assertion" has the actual scientific statement like "malaria is transmitted by mosquitos" in the example below; the "provenance" has information about how the statement in the "assertion" was originally derived; and the "publication information" has information like who created the nanopublication and when.

A nanopublication: basic elements from
Nanopublications may cite other nanopublications resulting in having complex citation tree. Trusty URIs are designed not only to validate nanopublications individually but also to validate the whole citation tree. The nanopublication example shown below, which is about the statement "malaria is transmitted by mosquitos", is from the paper ("The anatomy of a nanopublication") and it is in TRIG format:

@prefix swan: <> .
@prefix cw: <>.
@prefix swp: <>.
@prefix : <> .

:G1 = { cw:malaria cw:isTransmittedBy cw:mosquitoes }
:G2 = { :G1 swan:importedBy cw:TextExtractor,
:G1 swan:createdOn "2009-09-03"^^xsd:date,
:G1 swan:authoredBy cw:BobSmith }
:G3 = { :G2 ann:assertedBy cw:SomeOrganization }

In addition to the two modules, they are planning to define new modules for more types of content (e.g., hypertext/HTML) in the future.

The example below illustrates the general structure of trusty URIs:

The artifact code, everything after r1, is the part that make this URI a trusty URI. The first character in this code (R) is to identify the module. In the example, R indicates that this trusty URI was generated on a RDF graph. The second character (A) is to specify any version of this module. The remaining characters (5..0) represents the hash value on the content. All hash values are generated by SHA-256 algorithm. I think it would be more useful to allow users to select any preferred cryptographic hash function instead of enforcing a single hash function. This might result in adding more characters to the artifact code to represent the selected hash function. InterPlanetary File System (IPFS), for example, uses Multihash as mechanism to prefix the resulting hash value with an id that maps to a particular hash function. Similar to trusty URIs, resources in the IPFS network are addressed based on hash values calculated on the content. For instance, the first two characters "Qm" in the IPFS address "/ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V" indicates that SHA256 is the hash function used to generate the hash value "ZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V".

Here are some differences between the approach of using trusty URIs and other related ideas as mentioned in the paper:

  • Trusty URIs can be used to identify and verify resources on the web while in systems like Git version control system, hash values are there to verify "commits" in Git repositories only. The same applies to IPFS where hashes in addresses (e.g., /ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V) are used to verify files within the IPFS network only.
  • Hashes in trusty URIs can be generated on different kinds of content while in Git or ni-URI, hash values are computed based on the byte level of the content.
  • Trusty URIs support self-references (i.e., when trusty URIs are included in the content).

The same authors published a follow-up version to their ESWC paper ("Making digital artifacts on the web verifiable and reliable") in which they described in some detail how to generate trusty URIs on content of type RA for multiple RDF graphs and RB for a single RDF graph (RB was not included in the original paper). In addition, in this recent version, they graphically described the structure of the trusty URIs.

While calculating the hash value on the content of type F (byte-level file content) is a straightforward task, multiple steps are required to calculate the hash value on content of type R (RDF graphs), such as converting any serialization (e.g, N-Quads or TriG) into RDF triples, sorting of RDF triples lexicographically, serializing the graph into a single string, replacing newline characters with "\n", and dealing with self-references and empty nodes.

To evaluate their approach, the authors used the Java implementation to create trusty URIs for 156,026 of small structured data files (nanopublications) which are in different serialization format (N-Quads and TriX). By testing these files, again using the Java implementation, they all were successfully verified as they matched to their trusty URIs. In addition, they tested modified copies of these nanopublications. Results are shown in the figure below:

Examples of using trusty URIs:

[1] Trusty URI for byte-level content

Let say that I have published my paper on the web at, and somebody links to it or saves the link somewhere. Now, if I intentionally (or not) change the content of the paper, for example, by modifying some statistics, adding a chart, correcting a typo, or even replacing the PDF with something completely different (read about content drift), anyone downloads the paper after these changes by dereferencing the original URI will not be able to realize that the original content has been tampered with. Trusty URIs may solve this problem. For testing, I used Trustyuri-python, the Python implementation, to first generate the artifact code on the PDF file "tpdl-2015.pdf":

%python tpdl-2015.pdf

The file (tpdl-2015.pdf) is renamed to (tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf) containing the artifact code (FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao) as a part of its name -- in the paper, they call this file a trusty file. Finally, I published this trusty file on the web at the trusty URI ( Anyone with this trusty URI can verify the original content using the the library Trustyuri-python, for example:

Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao

As you can see, the output "Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao" indicates that the hash value in the
trusty URI is identical to the hash value of the content which means that this resource contains the correct and the desired content.

To see how the library detects any changes in the original content available at, I replaced all occurrence of the number "61" with the number "71" in the content. Here is the commands I used to apply these changes:

%pdftk tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf output tmp.pdf uncompress
%sed -i 's/61/71/g' tmp.pdf
%pdftk tmp.pdf output tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf compress

The figures below show the document before and after the changes:

Before changes
After changes
The library detected that the original resource has been changed:


[2] Trusty URIs for RDF content

I downloaded this nanopublication serialized in XML from "":

This nanopublication (RDF file) can be transformed into a trusty file using:

$python nanopub1-pre.xml

The Python script "" performed multiple steps to transform this RDF file into the trusty file "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". The steps as mentioned above include generating RDF triples, sorting those triples, handling self-references, etc. The Python library used the second argument "", considered as the original URI, to manage self-references by replacing all occurrences of "" with " " in the original XML file. You may have noticed that this ends with '.' and blank space. Once the artifact code is generated, the new trusty file is created "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". In this trusty file all occurrences of " " are replaced with "". The trusty file is shown below:

To verify this trusty file we can use the following command which resulting in having "Correct hash" --the content is verified to be correct. Again, to handle self-references, the Python library replaces all occurrences of "" with " " before recomputing the hash.

%python nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

Or by the following command if the trusty file is published on the web:

Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

What we are trying to do with trusty URIs:

We are working on a project, funded by the Andrew W. Mellon Foundation, to automatically capture and archive the scholarly record on the web. One part of this project is to come up with a mechanism through which we can verify the fixity of archived resources to ensure that these resources have not been tampered with or corrupted. In general, we try to collect information about the archived resources and generate manifest file. This file will then be pushed into multiple archives, so it can be used later. Herbert Van de Sompel, from Los Alamos National Laboratory, pointed to this idea of using trusty URIs to identify and verify web resources. In this way, we have the manifest files to verify archived resources, and trusty URIs to verify these manifests.


    --Mohamed Aturban

    Sunday, January 8, 2017

    2017-01-08: Review of WS-DL's 2016

    Sawood and Mat show off the InterPlanetary Wayback poster at JCDL 2016

    The Web Science and Digital Libraries Research Group had a productive 2016, with two Ph.D. and one M.S. students graduating, one large research grant awarded ($830k), 16 publications, and 15 trips to conferences, workshops, hackathons, etc.

    For student graduations, we had:
    Other student advancements:
    We had 16 publications in 2016:

    In late April, we had Herbert, Harish Shankar, and Shawn Jones visit from LANL.  Herbert has been here many times, but this was the first visit to Norfolk for Harish.  It was also on the visit that Shawn did his breadth exam.

    In addition to the fun road trip to JCDL 2016 in New Jersey (which included beers on the Cape May-Lewes Ferry!), our group traveled to:
    WS-DL at JCDL 2016 Reception in Newark, NJ
    Alex shows off his poster at JCDL 2016
    Although we did not travel to San Francisco for the 20th Anniversary of the Internet Archive, we did celebrate locally with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving.  This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). We write plenty of papers, blog posts, etc. about technical issues and the mechanics of web archiving, but I'm especially proud of how we were able to assemble a wide array of personal stories about the social impact of web archiving.  I encourage you to take the time to go through these posts:

    We had only one popular press story about our research this year, with Tech.Co's "You Can’t Trust the Internet to Continue Existing" citing Hany SalahEldeen's 2012 TPDL paper about the rate of loss of resources shared via Twitter.

    We released several software packages and data sets in 2016:
    In April we were extremely fortunate to receive a major research award, along with Herbert Van de Sompel at LANL, from the Andrew Mellon Foundation:
    This project will address a number of areas, including: Signposting, automated assessment of web archiving quality, verification of archival integrity, and automating the archiving of non-journal scholarly output.  We will soon be releasing several research outputs as a result of this grant.

    WS-DL reviews are also available for 2015, 2014, and 2013.  We're happy to have graduated Greg, Yasmin, and Justin; and we're hoping that we can get Erika back for a PhD after her MS is completed.  I'll close with celebratory images of me (one dignified, one less so...) with Dr. AlNoamany and Dr. Brunelle; may 2017 bring similarly joyous and proud moments.