Skip to content

Commit

Permalink
Merge pull request #413 from oduwsdl/issue-405
Browse files Browse the repository at this point in the history
Provide better breakdown of HTML pages, add new sample data
  • Loading branch information
machawk1 authored Jul 3, 2018
2 parents e7c3f29 + dae6458 commit f1459bc
Show file tree
Hide file tree
Showing 3 changed files with 199 additions and 5 deletions.
19 changes: 15 additions & 4 deletions ipwb/replay.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,11 @@ def showWebUI(path):
if os.path.exists(iFileAbs):
iFile = iFileAbs # Local file

(mCount, uniqueURIRs) = retrieveMemCount(iFile)
content = content.replace(
'MEMCOUNT', str(retrieveMemCount(iFile)))
'MEMCOUNT', str(mCount))
content = content.replace(
'UNIQUE', str(uniqueURIRs))

content = content.replace(
'let uris = []',
Expand Down Expand Up @@ -836,21 +839,29 @@ def retrieveMemCount(cdxjFilePath=INDEX_FILE):
print("Retrieving URI-Ms from {0}".format(cdxjFilePath))
indexFileContents = getIndexFileContents(cdxjFilePath)

errReturn = (0, 0)

if not indexFileContents:
return 0
return errReturn
lines = indexFileContents.strip().split('\n')

if not lines:
return 0
return errReturn
mementoCount = 0

bucket = {}
for i, l in enumerate(lines):
validCDXJLine = ipwbConfig.isValidCDXJLine(l)
metadataRecord = ipwbConfig.isCDXJMetadataRecord(l)
if validCDXJLine and not metadataRecord:
mementoCount += 1
surtURI = l.split()[0]
if surtURI not in bucket:
bucket[surtURI] = 1
else: # Unnecessary to keep count now, maybe useful later
bucket[surtURI] += 1

return mementoCount
return mementoCount, len(bucket.keys())


def objectifyCDXJData(lines, onlyURI):
Expand Down
183 changes: 183 additions & 0 deletions ipwb/samples/warcs/5mementosAndFroggie.warc
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2017-02-18T10:00:00Z
WARC-Filename: ipwb-memento.warc
WARC-Record-ID: <urn:uuid:7c1adc7b-7f62-49c3-b3d3-42ee1c6345d6>
Content-Type: application/warc-fields
Content-Length: 238

software: Fabricated
ip: 127.0.0.1
hostname: localhost
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
description: SampleCrawl
robots: ignore
http-header-user-agent: WARCFab/1.0



WARC/1.0
WARC-Type: response
WARC-Target-URI: http://memento.us/
WARC-Date: 2014-01-14T10:00:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 186

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Memento for 1/14/2014 10:00am</body></html>




WARC/1.0
WARC-Type: response
WARC-Target-URI: http://memento.us/
WARC-Date: 2014-01-15T10:15:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 186

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Memento for 1/15/2014 10:15am</body></html>



WARC/1.0
WARC-Type: response
WARC-Target-URI: http://memento.us/
WARC-Date: 2013-02-02T10:00:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 186

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Memento for 2/2/2013 10:00am</body></html>



WARC/1.0
WARC-Type: response
WARC-Target-URI: http://memento.us/
WARC-Date: 2016-12-31T11:00:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 187

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Memento for 12/31/2016 11:00am</body></html>



WARC/1.0
WARC-Type: response
WARC-Target-URI: http://memento.us/
WARC-Date: 2016-12-31T11:00:01Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 187

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Memento for 12/31/2016 11:01am</body></html>




WARC/1.0
WARC-Type: response
WARC-Target-URI: http://someotherURI.us/
WARC-Date: 2016-12-31T11:00:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 170

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>SomeotherURI</body></html>





WARC/1.0
WARC-Type: response
WARC-Target-URI: http://anothersite.us/
WARC-Date: 2016-12-31T11:00:00Z
WARC-Payload-Digest: sha1:3KRQHQ65T23N52AOS5QLFTIMWZIOO7G5
WARC-Record-ID: <urn:uuid:ba892695-eaca-441c-b6f1-1733930df0a9>
Content-Type: application/http; msgtype=response
Content-Length: 170

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: text/html
Connection: close
Vary: Accept-Encoding

<html><body>Another site</body></html>





WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:24E90DE8-E7E4-4640-9177-3611F2063389>
WARC-Warcinfo-ID: <urn:uuid:B5CFB241-5F99-468D-8EA8-6639E9A93CC3>
WARC-Target-URI: http://whensAPNGNotAPing.net
WARC-Date: 2017-03-01T19:26:39Z
WARC-Block-Digest: sha1:SZPTTOGV3LYYR6H7OMA7QC6YKZACNQSY
Content-Type: image/png
Content-Length: 154

HTTP/1.1 200 OK
Server: nginx
Date: Mon, 30 Jan 2017 18:39:49 GMT
Content-Type: image/png
Connection: close
Vary: Accept-Encoding

Ceci n'est pas une PNG.
2 changes: 1 addition & 1 deletion ipwb/webui/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ <h1><img src="./webui/logo.png" alt="ipwb" /></h1>
</footer>
<div id="uris" class="hidden">
<h3 id="urisHeader"><abbr title="Uniform Resource Identifiers">URIs</abbr> locally available</h3>
<h4 id="htmlCountHeader"><span id="htmlPages">0</span> HTML page<span id="htmlPagesPlurality">s</span> listed <button id="showEmbeddedURI">Show all</button></h4>
<h4 id="htmlCountHeader">MEMCOUNT mementos of UNIQUE resources with <span id="htmlPages">0</span> HTML page<span id="htmlPagesPlurality">s</span> listed <button id="showEmbeddedURI">Show all</button></h4>
<ul id="uriList"></ul>
</div>

Expand Down

0 comments on commit f1459bc

Please sign in to comment.