Thursday, August 25, 2016

2016-08-25: Documenting the Now Advisory Board Meeting Trip Report

On August 21-23, 2016, I attended the Advisory Board Meeting for the Documenting the Now (DocNow) project at the Washington University in St. Louis.  The DocNow project is funded by the Andrew Mellon Foundation "aims to collect, archive, and provide access to social media feeds chronicling historically significant events, particularly concerning social justice."   In practice, this means providing a friendly interface for interacting with trending events on Twitter (e.g., #BlackLivesMatter and affiliated hashtags).  This is significant because tools like twarc (created by Ed Summers, the technical lead for DocNow), a widely used Twitter archiving command line tool, are not within the scope of non-expert users. 

The DocNow has a strong project team and a diverse advisory board, of which I am honored to be a member of.  The team has been pretty active on github, slack, Twitter, etc., but those are no substitute for an extended f2f meeting.

The day began on the 22nd with a welcome and a contextualization for DocNow by the first panel (Jessica Johnson, Mark Anthony Neal, Sarah Jackson).  The session were recorded and will be released within the next week or two, so I won't try to completely reconstruct the discussion here, but some of the highlights that I noted include: 1) archives are necessary to create the context in which to evaluate content (the example was #FreeWakaFlocka being confused as a Sesame Street reference), and 2) real-time, self-reflection / self-awareness of  Twitter being a communications channel and archival record, and 3) a preview of the ethics involved in processing personal redaction / take down requests.  Some of the resources I noted were: Research Ethics for Students and Teachers: Social Media in the Classroom, Hijacking #myNYPD: Social Media Dissent and Networked Counterpublics, and African American celebrity dissent and a tale of two public spheres: a critical and comparative analysis of the mainstream and black press, 1949-2005.

Panel #2 featured the personal reflections of activists Reuben Riggs, Kayla Reed, Alexis Templeton, and Rasheen Aldridge, expertly moderated by Jonathan Fenderson.  I'm certainly not going to try to summarize their compelling contributions -- you really need to watch the video.  One resource I noted was the story of the Palestinian woman giving notes to Ferguson protesters about how to deal with tear gas.  I also noted that the activists' use of social media was, at least initially, not entirely focused on Twitter.  This has implications because as researchers, we tend to focus on Twitter exclusively, largely because it's the easiest to interact with.




Panel 3 (Yvonne Ng, Stacie Williams, Alexandra Dolan-Mescal, Dexter Thomas) resumed the ethics of discussion from the end of Panel 1.   Yvonne worked through a set of examples about archivists / reports including videos (e.g., from YouTube) with PII (see: Ethical Guidelines for Using Video in Human Rights Reporting).  The mood in the room at the time was definitely trending to protecting / anonymizing.  I asked the question of how to reconcile this level of editing with the guidance from Panel 2, which included (in so many words) "be sure to document everything, including the ugly".  I don't think we really successfully addressed this question.  Stacie covered the story of aggregating  various #WhatIWasWearing tweets and getting consent from the authors.  Dexter echoed the issue of consent, drawing from his experience at the LA Times.  Alexandra even went as far as saying "it's a surveillance tool", and questioned the archiving process in general.

I was on Panel 4, along with Brooke Foucault-Welles and Deen Freelon.  I went last and was so focused on my upcoming presentation, so my notes for my co-panelists are uneven.  Deen discussed some of his open source tools, and briefly mentioned the problem of disappearing tweets.  I did write down Brooke's closing three points: 1) "data storage is cheap, data usability is expensive" (with some stories of her "data wrangling"), 2) "tradeoff between parsimonious vs. inclusivity", which summarizes nicely as the "stegosaurus problem" -- apparently they were relatively rare but preserved well, and 3) "diversifying data", including the context of the larger platform itself and the observation that the Twitter of 2009 is not the same as the Twitter of 2014.

 I talked about why we need multiple, independent web archives:




Panel 5 had Brian Deitz, Jarrett Drake, Natalie Baur, and Samantha Abrams, discussing documenting a community.  Samantha discussed her work as a "guerilla archivist", quasi-officially archiving #theRealUW (see her blog post "On establishing a web archiving platform").  Brian's echoed some of the same points, and contrasted #ChapelHillShooting vs. #Our3Winners.  Natalie discussed creating an archive around the time the US normalized relations with Cuba, and Jarrett discussed #OccupyNassau.

The final panel of the day featured Sylvie Rollason-Cass, Ilya Kreymer, Matt Phillips, and Nicholas Taylor.  Ilya gave a demo of webrecorder.io, and I believe everyone else had slides even though I can't find them: Sylvie covered the range of services and projects from Archive-It, Matt reviewed Perma.cc and other projects at LIL, and Nicholas talked about the WASAPI project.

The second day was a half day, and wasn't recorded.  Alexandra lead us in a User Story Map exercise in an effort to further flesh out user requirements.  She had four defined user types defined (I didn't write them down), but there was discussion about adding a fifth: the "authority" persona that would use the archive to expose and punish the participants.




We concluded the day with Dan Chudnov giving a short demo of the current tool.  I won't really go into details since it is likely to change significantly (they were adamant about it being an early discussion piece), but it is far ahead of tools like twarc for supporting guided exploration.


I think the meeting was very successful, and I'm grateful to the organizers (Desiree Jones-Smith, Bergis Jules, et al.) for including me on the Advisory Board and inviting me to St. Louis.  I'll add the video links when they're uploaded, and in the mean time you can rewind the #docnowcommunity hashtag to get a feel for the many things I missed (Samantha is keeping a list of resources shared over #docnowcommunity).

--Michael

2016-08-25: Two WS-DL Classes Offered for Fall 2016


Two Web Science & Digital Library (WS-DL) courses will be offered in Fall 2016:
Obviously there is demand for CS 418/518, but if you're considering CS 734/834 you might be interested in this student's quote from a recent exit exam:
[another course], in addition to Dr. Nelson’s Information Retrieval course are the two which I feel have prepared me most for job interviews and work in the working world of computer science.
We're not yet sure what WS-DL courses will be offered in Spring 2017, so take advantage of these offerings in the Fall.

--Michael

Monday, August 15, 2016

2016-08-15: Mementos In the Raw, Take Two


In a previous post, we discussed a way to use the existing Memento protocol combined with link headers to access unaltered (raw) archived web content. Interest in unaltered content has grown as more use cases arise for web archives.
Ilya Kremer and David Rosenthal had previously suggested that a new dimension of content negotiation would be necessary to allow clients to access unaltered content. That idea was not originally pursued, because it would have required the standardization of new HTTP headers. At the time, none of us were aware of the standard Prefer header from RFC7240. Prefer can solve this problem in an intuitive way much like their original suggestion of content negotiation.
To recap, most web archives augment mementos when presenting them to the user, often for usability or legal purposes. The figures below show examples of these augmentations.

Figure 1: The PRONI web archive augments mementos for user experience; augmentations outlined in red

Figure 2: The UK National Archives adds additional text and a banner to differentiate their mementos from their live counterparts, because their mementos appear in Google search results
Additionally, some archives rewrite links to allow navigation within an archive. This way the end user can visit other pages within the same archive from the same time period. Smaller archives, because of the size of their collections, do not benefit as much from these rewritten links. Of course, for Memento users, these rewritten links are not really required.
In many cases, access to the original, unaltered content is needed. This is, for example, the case for some research studies that require the original HTTP response headers and the original unaltered content. Unaltered content is also needed to replay the original web content in projects like oldweb.today and the TimeTravel's Reconstruct feature.
The previously proposed solution was based on the use of two TimeGates, one to access augmented content (which is the current default) and an additional one to access unaltered content. In this post, we discuss a less complex method of acquiring raw mementos. This solution provides a standard way to request raw mementos, regardless of web archive software or configuration, and eliminates the need for archive-specific or software-specific heuristics.
The raw-ness of a memento exists in several dimensions, and the level of raw-ness that is required depends on the nature of the application:
  1. No augmented content - The memento should contain no additional HTML, JavaScript, CSS, or text added for usability or any other purpose. Its content should exist as it did on the web at the moment it was captured by the web archive.
  2. No rewritten links - The links should not be rewritten. The links within the memento content should exist as they did on the web at the moment the memento was captured by the web archive.
  3. Original headers - The original HTTP response headers should be available, expressed as X-Archive-Orig-*, like X-Archive-Orig-Content-Type: text/html. Their values should be the same as those of the corresponding headers without the X-Archive-Orig- prefix (e.g. Content-Type) at the moment of capture by the web archive.
We propose a solution that uses the Prefer HTTP request header and the Preference-Applied response header from RFC7240.
Consider a client that prefers a true, raw memento for http://www.cnn.com. Using the Prefer HTTP request header, this client can provide the following request headers when issuing an HTTP HEAD/GET to a memento.
GET /web/20160721152544/http://www.cnn.com/ HTTP/1.1 Host: web.archive.org Prefer: original-content, original-links, original-headers Connection: close
As we see above, the client specifies which level of raw-ness it prefers in the memento. In this case, the client prefers a memento with the following features:
  1. original-content - The client prefers that the memento returned contain the same HTML, JavaScript, CSS, and/or text that existed in the original resource at the time of capture.
  2. original-links - The client prefers that the memento returned contain the links that existed in the original resource at the time of capture.
  3. original-headers - The client prefers that the memento response uses X-Archive-Orig-* to express the values of the original HTTP response headers from the moment of capture.
The memento then responds with the headers below.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 17:34:15 GMT Content-Type: text/html;charset=utf-8 Content-Length: 109672 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 17:34:15 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544im_/http://www.cnn.com/ Vary: prefer Preference-Applied: original-content, original-links, original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20160120080735/http://www.cnn.com/>; rel="first memento"; datetime="Wed, 20 Jan 2016 08:07:35 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
The response also uses the Preference-Applied header to indicate that it is providing the original-headers and the content has its original-links and original-content. It is possible, of course, for a system to satisfy only some of these preferences, and the Preference-Applied header allows the server to indicate which ones.
The Vary header also contains prefer, indicating that clients can influence the memento's response by using this header. The response can then be cached for requests that have the same options in the request headers.
Based on these preferences, the content of the response has been altered from the default. The Content-Location header informs clients of the exact URI-M that meets these preferences for this memento, in this case http://web.archive.org/web/20160721152544im_/http://www.cnn.com/.
The memento returned contains the original content and the original links, as seen in the figure below, and the original headers provided as X-Archive-Orig-* as shown in the above response.
Figure 3: Seen in this example is a memento with original-content - no banner added - and original-links as seen in the magnified inspector output from Firefox.

If the client issues no Prefer header in the request, then the server can still use the Preference-Applied header to indicate which preferences are met by default. Again, the Vary header indicates that clients can influence the response via the use of the Prefer request header. The Content-Location header indicates the URI-M of the memento. The response headers for such a default memento from the Internet Archive are shown below, with its original headers expressed in the form of X-Archive-Orig-* and bolded for emphasis.
HTTP/1.1 200 OK Server: Tengine/2.1.0 Date: Thu, 21 Jul 2016 16:17:09 GMT Content-Type: text/html;charset=utf-8 Content-Length: 127383 Connection: keep-alive set-cookie: wayback_server=60; Domain=archive.org; Path=/; Expires=Sat, 20-Aug-16 16:17:07 GMT; Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT Content-Location: /web/20160721152544/http://www.cnn.com/ Vary: prefer Preference-Applied: original-headers Link: <http://www.cnn.com/>; rel="original", <http://web.archive.org/web/timemap/link/http://www.cnn.com/>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/http://www.cnn.com/>; rel="timegate", <http://web.archive.org/web/20160721152544/http://www.cnn.com/>; rel="last memento"; datetime="Thu, 21 Jul 2016 15:25:44 GMT", <http://web.archive.org/web/20000620180259/http://www.cnn.com/>; rel="first memento"; datetime="Tue, 20 Jun 2000 18:02:59 GMT", <http://web.archive.org/web/20160721143259/http://www.cnn.com/>; rel="prev memento"; datetime="Thu, 21 Jul 2016 14:32:59 GMT" Set-Cookie: JSESSIONID=3652A3AF37E6AF4FB5C7DEF16CC8084E; Path=/; HttpOnly X-Archive-Orig-x-served-by: cache-iad2120-IAD, cache-sjc3632-SJC X-Archive-Orig-x-cache-hits: 1, 13 X-Archive-Guessed-Charset: utf-8 X-Archive-Orig-cache-control: max-age=60 X-Archive-Orig-x-xss-protection: 1; mode=block X-Archive-Orig-content-type: text/html; charset=utf-8 X-Archive-Orig-age: 184 X-Archive-Orig-x-timer: S1469114744.153501,VS0,VE0 X-Archive-Orig-set-cookie: countryCode=US; Domain=.cnn.com X-Archive-Orig-access-control-allow-origin: * X-Archive-Orig-content-security-policy: default-src 'self' http://*.cnn.com:* https://*.cnn.com:* *.cnn.net:* *.turner.com:* *.ugdturner.com:* *.vgtf.net:*; script-src 'unsafe-inline' 'unsafe-eval' 'self' *; style-src 'unsafe-inline' 'self' *; frame-src 'self' *; object-src 'self' *; img-src 'self' * data: blob:; media-src 'self' *; font-src 'self' *; connect-src 'self' *; X-Archive-Orig-accept-ranges: bytes X-Archive-Orig-vary: Accept-Encoding X-Archive-Orig-connection: close X-Archive-Orig-x-servedbyhost: prd-10-60-168-38.nodes.56m.dmtio.net X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT X-Archive-Orig-via: 1.1 varnish X-Archive-Orig-content-length: 109672 X-Archive-Orig-x-cache: HIT, HIT X-Archive-Orig-fastly-debug-digest: 1e206303e0672a50569b0c0a29903ca81f3ef5033de74682ce90ec9d13686981
For this default memento, shown in the figure below, the links are rewritten and the presence of the Wayback banner indicates that additional content has been added.
Figure 4: This default memento contains added content in the form of a banner outlined in red on top as well as rewritten links, shown using Firefox's inspector and magnified on the bottom.
We are confident that it is legitimate to use the Prefer header in this way. Even though the original RFC contains examples requesting different representations using only the PATCH, PUT, and POST methods, a draft RFC for the "safe" HTTP preference mentions its use with GET in order to modify the content of the requested page. This draft RFC has already been implemented in Mozilla Firefox and Internet Explorer. It is also used in the W3C Open Annotation Protocol to indicate the extent to which a resource should include annotations in its representation.
Compared to our previously described approach, this solution is more elegant in its simplicity and intuitiveness. This approach also allows the introduction of other client preferences over time, if such a need would emerge. These preferences can and should be registered in accordance with RFC7240. The client specifies which features of a memento it prefers and the memento itself indicates which features it has satisfied while ensuring its response satisfies those preferred features.
We seek feedback on this solution, including what additional dimensions clients may prefer beyond the three we have specified.
--
Herbert Van de Sompel
- and -
Michael L. Nelson
- and -
Lyudmila Balakireva
- and -
Martin Klein
- and -
- and -
Harihar Shankar

Sunday, July 24, 2016

2016-07-24: Improve research code with static type checking

The Pain of Late Bug Detection

[The web] is big. Really big. You just won't believe how vastly, hugely, mindbogglingly big it is... [1]

When it comes to quick implementation, Python is an efficient language used by many web archiving projects. Indeed, a quick search of github for WARC and Python yields a list of 80 projects and forks. Python is also the language used for my research into the temporal coherence of existing web archive holdings.

The sheer size of the Web means lots of variation and lots low-frequency edge cases. These variations and edge cases are naturally reflected in web archive holdings. Code used to research the Web and web archives naturally contains many, many code branches.

Python struggles under these conditions. It struggles because minor changes can easily introduce bugs that go undetected until much later. And later for Python means at run time. Indeed the sheer number of edge cases introduces code branches that are exercised so infrequently that code rot creeps in. Of course, all research code dealing with the web should create checkpoints and be restartable as a matter of self defense—and mine does defend itself. Still, detecting as many of these kinds of errors up front, before run time is much better than dealing with a mid-experiment crash.

[1] Douglas Adams may have actually written something a little different.

Static Typing to the Rescue

Static typing allows detection of many types of errors before code is executed. Consider the function definitions in figure 1 below. Heuristic is an abstract base class for memento selection heuristics. In my early research, memento selection heuristics required only Memento-Datetime. Subsequent work introduced selection based on both Memento-Datetime and Last-Modified. When the last_modified parameter was added, the cost functions were update accordingly—or so I thought. Notice that the last_modified parameter is missing from the PastPreferred cost function. Testing did not catch this oversight (see "Testing?" below). The addition of static type checking did.

class Heuristic(object):
     ...
class MinDist(Heuristic):
    def cost(self, memento_datetime, last_modified=None):

class Bracket(Heuristic):
    def cost(self, memento_datetime, last_modified):

class PastPreferred(Heuristic):
    def cost(self, memento_datetime):

Figure 1. Original Code
Static type checking is available for Python through the use of type hinting. Type hinting is specified in PEP 484 and is implemented in mypy. Type hints do not change Python execution; they simply allow mypy to programmatically check expectations set by the programmer. Figure 2 shows the heuristics code with type hints added. Note the addition of the cost function to the Heuristic class. Although not implemented, it allows the type checker to ensure that all cost functions conform to expectations. (This is the addition that led to finding the PastPreferred.cost bug.)

class Heuristic(object):
    def cost(self, memento_datetime: datetime,
             last_modified: Optional[datetime]) \
             -> Tuple[int,datetime]:
        raise NotImplementedError

class MinDist(Heuristic):
    def cost(self, memento_datetime: datetime,
             last_modified: Optional[datetime] = None) \
             -> Tuple[int,datetime]:

class Bracket(Heuristic):
    def cost(self, memento_datetime: datetime,
             last_modified: Optional[datetime]) \
             -> Tuple[int,datetime]:

class PastPreferred(Heuristic):
    def cost(self, memento_datetime: datetime,
             last_modified: Optional[datetime] = None) \
             -> Tuple[int,datetime]:

Figure 2. Type Hinted Code

Testing?

Many have argued that if code is well tested, the extra work introduced by static type checking out weighs the benfits. But what about bugs in the tests? (After all tests are code too—and not immune from programmer error). The code shown in Figure 1 had a complete set of tests (i.e. 100% coverage). However when Last-Modified was added, the PastPreferred tests were not updated and continued to pass. The addition of static type checking revealed the PastPreferred test bug, three research code bugs missed by the tests, and over dozen other test bugs. Remember, "Test coverage is of little use as a numeric statement of how good your tests are."

— Scott G. Ainsworth

Thursday, July 21, 2016

2016-07-21: Dockerizing ArchiveSpark - A Tale of Pair Hacking


"Some doctors prescribe application of sandalwood paste to remedy headache, but making the paste and applying it is no less of a headache." -- an Urdu proverb
This is the translation of a couplet from an Urdu poem which is often used as a proverb. This couplet nicely reflects my feeling when Vinay Goel from the Internet Archive was demonstrating how suitable ArchiveSpark was for our IMLS Museums data analysis during the Archives Unleashed 2.0 Datathon, in the Library of Congress, Washington DC on June 14, 2016. ArchiveSpark allows easy data extraction, derivation, and analysis from standard web archive files (such as CDX and WARC). On the back of my head I was thinking, it seems nice, cool, and awesome to use ArchiveSpark (or Warcbase) for the task, and certainly a good idea for serious archive data analysis, but perhaps an overkill for a two day hackathon event. Installing and configuring these tools would have required us to setup a Hadoop cluster, Jupyter notebook, Spark, and a bunch of configurations for the ArchiveSpark itself. After doing all that, we would have to setup an HDFS storage and import a few terabytes of archived data (CDX and WARC files) into it. It would have easily taken up a whole day for someone new to these tools, leaving almost no time for the real data analysis. That is why we decided to use standard Unix text processing tools for CDX analysis.



Pair Hacking


Fast-forward to the next week, we were attending JCDL 2016 in Rutgers University, New Jersey. On June 22, during the half an hour coffee break I asked Helge Holzmann, the developer of ArchiveSpark, to help me understand the requirements and steps involved for a basic ArchiveSpark setup on a Linux machine so that I can create a Docker image to eliminate some friction for new users. We sat down together and discussed the minimal configurations that would make the tool work on a regular file system on a single machine without the complexities of a Hadoop cluster and HDFS. Based on his instructions, I wrote a Dockerfile that can be used to build a self contained, pre-configured, and ready to spin Docker image. After some tests and polish I published the ArchiveSpark Docker image publicly. This means, running an ArchiveSpark instance is now as simple as running the following command (assuming, Docker is installed on the machine):

$ docker run -p 8888:8888 ibnesayeed/archivespark

This command essentially means, run a Docker container from the ibnesayeed/archivespark image and map the internal container port 8888 to the host port 8888 (to make it accessible from outside the container). This will automatically download the images from Docker Hub if not in the local cache (which will be the case for the first run). Once the service is up an running (which will take a few minutes for the first time depending on the download speed, but subsequent runs will take a couple of seconds), the notebook will be accessible from a web browser at http://localhost:8888/. The default image is pre-loaded with some example files, including a CDX file, a corresponding WARC file, and a notebook file to get started with the system. To work on your own data set, please follow the instructions to mount host directories of CDX, WARC, and notebook files inside the container.

As I tweeted about this new development, I got immediate encouraging responses from different people using Linux, Mac, and Windows machines.


Under the hood


For those who are interested in knowing what is happening under the hood of this Docker image, I will walk through the Dockerfile itself to explain how it is built.

We have used the official jupyter/notebook image as the base image. This means we are starting with a Docker image that includes all the necessary libraries and binaries to run the Jupyter notebook. Next, I added my name and email address as maintainer of the image. Then we installed JRE using standard apt-get command. Next, we downloaded Spark binary with Hadoop from a mirror and extracted in a specific directory (this location will later be used in a configuration file). Then we downloaded the ArchiveSpark kernel and extracted it in a location where Spark expects the kernels to reside. Next, we overwrite the configuration file of the ArchiveSpark kernel using a customized kernel.json file. This custom configuration file overwrites some placeholders of the default config file, specifies the Spark directory (where Spark was extracted), and modifies it to run in a non-cluster mode on a single machine. Next three lines add sample files/folders (example.ipynb file, cdx folder, and warc folder respectively) in the container and create volumes where host files/folders can be mounted at run time to work on real data. Finally, the default command "jupyter notebook --no-browser" is added which will run by default when a container instance is spun without a custom command.


Conclusions


In conclusion, we see this dockerization of ArchiveSpark as a contribution to the web archiving community that eliminates the setup and getting started friction from a very useful archive analysis tool. We believe that this simplification will encourage increased usage of the tool in web archive related hackathons, quick personal archive explorations, research projects, demonstrations, and classrooms. We believe that there is a need and usefulness of dockerizing and simplifying other web archiving related tools (such as Warcbase) to give new users a friction-free choice to get started with different tools. Going forward, some improvements that can be made to the ArchiveSpark Docker image include (but not limited to) running the notebook inside the container under a non-root user, adding a handful of ready-to-run sample notebook files for common tasks in the image, and making the image configurable at run time (for example to allow local or Hadoop cluster mode and HDFS or plain file system store) while keeping the defaults that work well for simple usage.


Resources




--
Sawood Alam

Monday, July 18, 2016

2016-07-18: Tweet Visibility Dynamics in a Tweet Conversation Graph


We conducted another study in the same spirit as the first, as part of our research (funded by IMLS) to build collections for stories or events. This time we sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. We explored how the visibility of tweets in a conversation graph changes based on the tweet selected.

A need for archiving tweet conversations
Archiving tweets usually involves collecting tweets associated with a given hashtag. Even though this provides a "clean" way of collecting tweets about the event associated with the hashtag, something important is often missed - conversations. Not all tweets about a particular topic will have the given hashtag,  including portions of a threaded conversation, even if the initial tweet contained the hashtag. This is unfortunate because conversations may provide contextual information about tweets.
Consider the following tweet by @THEHermanCain, which contains #TrumpSpeechinFourWords
Using #TrumpSpeechinFourWords@THEHermanCain's tweet is collected. However, tweets which replied his tweet but did not include the hashtag in the reply will be excluded from the tweet collection - conversations will be excluded from the collection:

Tweets in the conversation without #TrumpSpeechinFourWords will be excluded from tweet collection
I consider conversations an important aspect of the collective narrative. Before we can archive conversations, we need to understand their nature and structure.
Fig 1: A Hypothetical Tweet Conversation Graph consisting of 8 tweets. An arrowhead points in the direction of a reply. For example, t8 replied t5.
It all began when we started collecting tweets about the Ebola virus. After collecting the tweets, Dr. Nelson expressed an interest in seeing not just the collected tweets, but the collected tweets in context of the tweet conversations they belong to. For example, if through our tweet collection process, we collected tweet t8 (Fig. 1), we were interested at the very least in discovering t5 (replied by t8), t2 (replied by t5), and t1 (replied by t2). A more ambitious goal was to discover the entire graph which contained t8 (Fig. 1: t1 - t8). In order to achieve this, I began by attempting to understand the nature of the tweet graph from two perspectives - the browser's view of the tweets and the Twitter API's view.
Terminology


Fig 2: Root, Parent and Child tweets.
  1. Root tweet: is a tweet which is not a reply to another tweet. But may be replied by other tweets. For example, t1 (Fig. 1).
  2. Parent tweet: is a tweet with replies, called children. A parent tweet can also be a child tweet to the tweet it replied. For example, t2 (Fig. 1) is a parent to t4 - t6, but a child to t1.
  3. Child tweet: is a tweet which is a reply to another tweet. The tweet it replied is called its parent. For example, t8 (Fig. 1) is the child of t5.
  4. Ancestor tweets: refers to all parent tweets which precede a parent tweet. For example, the ancestor tweets of t8 are t1, t2 and t5.
  5. Descendant tweets: refers to all child tweets which follow a parent tweet. For example, the descendants of t2 are t4, t5, t6 and t8.
Tweet visibility dynamics in a tweet conversation graph - Twitter API's perspective:
The API provides an entry called in_reply_to_status_id in a tweet json. With this entry, every tweet in the chain of replies before a tweet, can be retrieved. This means this option does not let you get tweets which are replies to a current tweet. For example, if we selected tweet t1 (a root tweet), with the API, since t1 did not reply another tweet (has no parent), we will not be able to retrieve any other tweet, because we can only retrieve tweets in one direction (Fig. 3 left). If we selected a tweet t2, the in_reply_to_status_id of t2 points to t1, so we can retrieve t1 (Fig. 3 right).

Fig 3: Through the API, from t1, no tweets can be retrieved, from t2, we can retrieve its parent reply tweet, t1
t5's in_reply_to_status_id points to t2, so we retrieve t2 and then t1 (Fig. 4 left). From t8, we retrieve t5, which retrieves t2, which retrieves t1 (Fig. 4 right)So with the last tweet in a tweet conversation reply chain, we can get all the tweet parents (and parent's descendants).

Fig 4: Through the API, from t8 we can retrieve t5, and from t5 we can retrieve t2, and from t2 we can retrieve t1

To summarize the API's view of tweets in a conversation, given a selected tweet, we can see the parent tweets (plus parent ancestors - above), but NOT children tweets (plus children descendants - below), and NOT sibling tweets (sideways).
Tweet visibility dynamics in a tweet conversation graph - browser's perspective:
By browsing Twitter, we observed that given a selected tweet in a conversation chain, we can see the tweet it replied (parents and parent's ancestors), as well as the tweet's replies (children and children's descendants). For example, given t8, we will be able to retrieve t5, t2, and t1 just like the API (Fig. 5). 
Fig 5: From t8 we can access t5, t2 and t1


However, unlike the API, if we had t1, we will be able to retrieve t1 - t8, since t1 is the root tweet (Fig. 6).

Fig 6: From t1 we can access t2 - t8
To summarize the Browser's view of tweets in a conversation, given a selected tweet, we can see the parent tweets, (plus parent ancestors - above) and children tweets (plus children descendants - below), but NOT sibling tweets (sideways).
Our findings are summarized in the following slides:

Methods for extracting tweet conversations
1. Scraping: Twitter does not encourage scraping as outlined in its Terms of Service: "...NOTE: crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited...". Therefore, the description provided here for extracting a tweet conversation based on scraping is purely academic. Based on the visibility dynamics of a tweet from the browser's perspective, the best start position for collecting a tweet conversation is the root position. Consequently, find the root, then access the children from the root. However, if you are only interested in the conversation surrounding a single tweet, given the single tweet, from the browser, its parent (plus parent ancestors) and children (plus children descendants) are available for extraction.
2. API Method 1: This method which is based on the API's tweet visibility can only get the parent (plus parent ancestors). Given a tweet, get the tweet's parent (by accessing its in_reply_to_status_id). When you get the parent, get the parent's parent (etc.) through the same method until no you reach the root tweet. 
3. API Method 2: This method was initially described to me by Sawood Alam and later, independently implemented by Ed Summers. It uses the Twitter search API. Here is Ed Summers description:
Twitter's API doesn't allow you to get replies to a particular tweet. Strange but true. But you can use Twitter's Search API to search for tweets that are directed at a particular user, and then search through the results to see if any are replies to a given tweet. You probably are also interested in the replies to any replies as well, so the process is recursive. The big caveat here is that the search API only returns results for the last 7 days. So you'll want to run this sooner rather than later.
Informal time analysis of extracting tweets
We also considered a simple informal analysis (as opposed to asymptotic analysis based on Big-O) to estimate how long (in seconds) it might take to extract tweets by using the Twitter API vs the browser (by responsibly scraping Twitter). This analysis only considers counting the number of request issued in other to access tweets.
Informal time analysis for extracting tweets with the API:
The statuses API access point (used to get tweets by ID) imposes a rate limit of 180 requests per 15 minutes (1 request every 5 seconds). Given a tweet t(i) in a chain of tweets, the amount of time (seconds) to get the previous tweets in the conversation chain is:
5(i-1) seconds.
Informal time analysis for extracting tweets with the browser:
Consider a scraping implementation in which we retrieve tweets as follows:
  1. Load Twitter webpage for a tweet
  2. Sleep randomly based on value of ๐›ฟ in [1, ๐›ฟ], where ๐›ฟ > 1 
  3. Scroll to load new tweet content until we reach maxScrollForSingleRequest, (maxScrollForSingleRequest > 0). Exit when no new content loads.
  4. Repeat 3.
Based on the implementation described above, given a tweet t(i) with a maximum sleep time represented by a random variable ๐›ฟ in [1, ๐›ฟ] seconds, and a constant maximumScrollForSingle, which represents the maximum number of scrolls we make per request, the estimated amount of time to get the conversation is at most:
E[๐›ฟ] + (E[๐›ฟ] × maxScrollForSingleRequest) seconds; where E[๐›ฟ] = (1+๐›ฟ)/2
Since ๐›ฟ ~ U{1, ๐›ฟ}, (๐›ฟ is described by the Uniform distribution (discrete) and E[๐›ฟ] is the expected value).

Our findings are of consequence particularly to tweet Archivists who should understand the visibility dynamics of the tweet conversation graph.
--Nwala

Thursday, July 7, 2016

2016-07-07: Signposting the Scholarly Web

The web site for "Signposting the Scholarly Web" recently went online.  There is a ton of great content available and since it takes some time to process it all, I'll give some of the highlights here.

First, this is the culmination of ideas that have been brewing for some time (see this early 2015 short video, although some of the ideas can arguably be traced to this 2014 presentation).  Most recently, our presentation at CNI Fall 2015, our 2015 D-Lib Magazine article, and our 2016 tech report advanced the concepts.

Here's the short version: the purpose is to make a standard, machine-readable method for web robots and other clients to "follow their nose" as they encounter scholarly material on the web.  Think of it as similar (in purpose if not technique) to Facebook's Open Graph or FOAF, but for publications, slides, data sets, etc. 

Currently there are three basic functions in Signposting:
  1. Discovering rich, structured, bibliographic metadata from web pages.  For example, if my user agent is at a landing page, publication page, PDF, etc., then Signposting allows me to discover where the BibTeX, MARC, DC, or whatever metadata format the publisher makes available.  Lots of DC records "point to" scholarly web pages, but this defines how the pages can "point back" to their metadata.
  2. Provide bi-directional linkage between a web page and its DOI.  OK, technically it doesn't have to be a DOI but that's the most common case.  One can dereference a DOI (e.g., http://dx.doi.org/10.1371/journal.pone.0115253) and be redirected to the URI at the publisher's site (in this case: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253).  But there isn't a standardized, machine-readable method for discovering the DOI from the landing page, PDF, data set, etc. at the publisher's site (note: rel="canonical" serves a different purpose).  The problem is few people actually link to DOIs, instead they link to the final (and not stable) URL.  For example, this popular news story about cholesterol research links to the article at the publisher's site, but not the DOI.  For this purpose, we introduce rel="identifier", which allows items in a scholarly object to point back to their DOI (or PURLs, handles, ARKs, etc.). 
  3. Delineating what's part of the scholarly object and what is not.  Some links are clearly intended to be "part" of the scholarly object: the PDF, the slides, the data set, the code, etc.  Some links are useful, but not part of the scholarly object: navigational links, citation services, bookmarking services, etc.  You can think of this as a greatly simplified version of OAI-ORE (and if you're not familiar with ORE, don't worry about it; it's powerful but complex).  Knowing what is part of the scholarly object will, among other things, allow us to assess how well it is has been indexed, archived, etc.
Again, there's a ton of material at the site, both in terms of modeling common patterns as well as proposed HTTP responses for different purposes.  But right now it all comes down to providing links for three simple things: 1) the metadata, 2) the DOI (or other favorite identifier), 3) items "in the object". 

Please take a look at the site, join the Signposting list and provide feedback there about the current three patterns, additional patterns, possible use cases, or anything else. 

--Michael