-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdevelopment_notes.html
275 lines (272 loc) · 14 KB
/
development_notes.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
<!DOCTYPE html>
<html lang="en">
<head>
<title>COLD (Controlled Object List and Datum (Concept))</title>
<link href='https://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="https://caltechlibrary.github.io/css/site.css">
</head>
<body>
<header>
<a href="http://library.caltech.edu" title="link to Caltech Library Homepage"><img src="https://caltechlibrary.github.io/assets/liblogo.gif" alt="Caltech Library logo"></a>
</header>
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="index.html">README</a></li>
<li><a href="LICENSE">LICENSE</a></li>
<li><a href="INSTALL.html">INSTALL</a></li>
<li><a href="user_manual.html">User Manual</a></li>
<li><a href="about.html">About</a></li>
<li><a href="https://github.com/caltechlibrary/cold">GitHub</a></li>
</ul>
</nav>
<section>
<h1 id="development-notes">Development Notes</h1>
<p>NOTE: this document describes my thinking during the development
process. It is not a description of how things actually got
implemented.</p>
<h2 id="application-layout-and-structure">Application layout and
structure</h2>
<p>The primary task of the COLD UI is to provide a means of curating our
list of objects and vocabularies. Each list is held in a dataset
collection. Datasetd is used to provide a JSON API to curate the
collections. TypeScript compiled via Deno is providing the middleware to
tie our JSON API with our static content. The front end web server
(i.e. Apache 2) provides integration with Shibboleth provides single
sign on and access control.</p>
<p>I am relying on feeds.library.caltech.edu to provide the public
facing API. Data is transferred to feeds via scripts run on a schedule
or “on demand” via JavaScript contacting external systems. Deno can
compile the TypeScript code to JavaScript for browser consumption using
the “<span class="citation" data-cites="deno/emit">@deno/emit</span>”
module.</p>
<h2 id="the-go-dataset-and-models-package">The Go dataset and models
package</h2>
<p>The latest evolution of dataset includes support for restricting
collections to specific data models. A model is base on data types that
easily map from HTML 5 form elements to SQL data types. Additionally
there are types associated with library and archives such as support for
ISNI, ORCID and ROR. Dataset still supports ad hoc JSON object storage
if that is needed.</p>
<!--
The data models are enforced only via the datasetd service. Eventually model support will be unforced for the dataset cli.
Data models are expressed in YAML and are shared between dataset and the model YAML used by Newt. Both use the same models package written in Go. The models package provides a means of define more types as well as adding renders. It is being developed in parallel with Newt, Dataset and the COLD where the latter is providing a real world use case to test the approach.
-->
<p>The public API isn’t part of COLD directly. COLD is for curating
object lists but it does export those objects to
feeds.library.caltech.edu which then provides the public API. Content is
exported in JSON, YAML and CSV formats as needed by Caltech Library
systems and services.</p>
<h2 id="data-enhancement">Data enhancement</h2>
<p>The content curated in cold can be enhanced from external sources.
This is done via scheduled tasks. Initially these tasks are going to be
run from cron. An example is importing biographical information
published in the Caltech Directory. For a subset of CaltechPEOPLE we
know their IMSS userid. Using that we can contact the public directory
website and return the biographical details such as their faculty role
and title, division, and educational background. We only harvest those
records that have both a directory user id and are marked for inclusion
in feeds.</p>
<p>External data sources:</p>
<ul>
<li>Caltech Directory</li>
<li>orcid.org</li>
<li>ror.org</li>
</ul>
<h2 id="reports">Reports</h2>
<p>Reports are often needed for managing library data and systems.
COLD’s focus is on managing lists and data but can also serve as a
reports request hub.</p>
<p>Many of the reports require aggregation across data sources and often
these will take too long or require too many resources to be run
directly on our application server. That suggests what should run on the
applications server is a simple reports request management interface.
The suggestions the following requirements.</p>
<ul>
<li>A way to make a request for a report</li>
<li>A means of indicating a report status (e.g. requested or scheduled,
processing, available or problem indicator)</li>
<li>A means of notifying the requester(s) when report is available</li>
<li>A means of purging old reports for the reports status list</li>
</ul>
<p>These features can be implemented as a simple queue. The metadata
needed to manage a report requests and their life cycle are as
follows.</p>
<ul>
<li>name of report</li>
<li>any additional options needed by the report program</li>
<li>an email address(es) to contact when the report is ready</li>
<li>current status of the report (e.g. requested, processing, available,
problem)</li>
<li>a link to where the report can be “picked up”</li>
<li>the report’s content type, (e.g. application/json, application/yaml,
application/x-sqlite3, text/csv, text/plain, text/x-markdown)</li>
<li>the date the report was requested</li>
<li>the updated (when the status last changed)</li>
</ul>
<p>The user interface would consist of a simple web form to request a
predefined set up reports and a list of reports available, processing,
requested or scheduled.</p>
<p>The reports themselves can be implemented as command line programs in
a language of your choice. The report runner will be responsible for
checking the queue and updating the queue. The report would be
responsible for notification (e.g. is there is an email list then send
out an email with the report link). In principle since our GitHub
actions are accessible via the GitHub APIs a report could be implemented
as a GitHub action.</p>
<p>The advantage of this approach is that it avoids the problems of slow
running or resource intensive reports running directly on the
application server. COLD just manages the report queue.</p>
<p>Advantage of narrowing the COLD’s report to managing a report queue
is that it separates the concerns (e.g. resource management, security,
report access).</p>
<p>For the report management interface to be useful you do need a report
runner. The report runner would be responsible for checking the report
queue, updating status of the report queue and making the report
request.</p>
<p>NOTE: the runner doesn’t need to run on your apps server. It just
needs access to the queue.</p>
<p>A report would need to implement a few things.</p>
<ul>
<li>accept the metadata held in the report queue</li>
<li>storing the report result</li>
<li>return a result needed by the runner to update the report queue
(success, failure and the link to the result)</li>
</ul>
<p>QUESTION: Should the report be responsible for notification or the
runner?</p>
<p>The individual reports can be implemented as a script (e.g. Bash), a
program (e.g. something in Python) or even externally (e.g. GitHub
action). The interface for the report system takes advantage of standard
input and standard output. This simplifies writing the report programs.
An example would be to process a JSON expression from standard input and
return a JSON expression via standard output to the runner along with an
error code (i.e. zero no problem, non-zero there was a problem). The
report script or program would use a link to indicate where the report
could be picked up and be responsible to placing content in a storage
location accessible via the link.</p>
<p>Report status:</p>
<dl>
<dt>requested</dt>
<dd>
An entry that a request has been made and is waiting to be serviced
</dd>
<dt>processing</dt>
<dd>
The report request is being serviced but is not yet available
</dd>
<dt>available</dt>
<dd>
A report result is available and the link indicates where you can pick
it up
</dd>
<dt>problem</dt>
<dd>
The report request could not be completed and the link indicates where
the details can be found about what when wrong.
</dd>
</dl>
<p>Report identifiers:</p>
<p>There are two basic report types. Those which are run on a schedule
(e.g. recent grant report from thesis or creators report from authors)
and those which are requested then run. For the scheduled reports the
identifier would be in the reports’ unique name. For requested reports
another mechanism maybe required. A good candidate for the identifier
would be UUID v5. Since the report script or program is responsible for
storing the results it would also be responsible for versioning the
stored results if needed. By separating the ID from the report instance
it is left to the report what the name of the stored result is while
still being able to map a request to that link’s instance.</p>
<p>Reports can be of different content types. Most reports we generate
manually today are either CSV, tab delimited or Excel files. By allowing
reports to have different content types we also allow for the report to
be provided in a relevant type. E.g. a report could be generated as a
PDF or even an SQLite3 database.</p>
<h3 id="exploring-the-report-runner">Exploring the report runner</h3>
<p>COLD provides a collection called “reports.ds”. Assuming that
collection is readable on your data processing machine a runner needs to
be able to do several actions.</p>
<ol type="1">
<li>Retrieve the next report to initiate</li>
<li>Update the report status (e.g. request -> processing)</li>
<li>The runner needs to execute the shell command that implements the
report</li>
<li>Update the report status (e.g. processing -> available or
processing -> problem)</li>
</ol>
<p>The report runner repeats these four steps until there are no more
requests available. At that time it can sleep for a designated period of
time then start the loop again when requests are available.</p>
<p>To control what is executed it is desirable to have a specific
configurable task runner available. This will prevent arbitrary commends
from running.</p>
<p>Off the shelf task runners include</p>
<ul>
<li>Make, a build system dating back to the origin times of Unix</li>
<li>just, a new simpler command runner that is cross platform and
language agnostic</li>
</ul>
<p>The report runner would take the report request record, set status to
processing and then pass the report name and options to task runner.
When the task completed (either successfully or failing) the result
would be captured and stored in a designated storage system
(e.g. G-Drive) and the report request record would need to be updated
with the final status and link to the report or error report.</p>
<h2 id="date-handling">Date Handling</h2>
<p>The difference between date formats, languages and representation can
be considerable. The default way a the TypeScript/JavaScript Date object
render a date is “MM/DD/YYYY” using the <code>toDateString()</code>
instance method. Our databases and most of our code base expects date to
be formatted in “YYYY-MM-DD” so I am using two TypeScript/JavaScript
methods to achieve that. First you use <code>.toJSON()</code> to render
the date in JSON format then you trim the result to 11 characters using
<code>.substring(0,10)</code>.</p>
<h2 id="booleans-and-webforms">Booleans and webforms</h2>
<p>When the web form is transcribed checkboxes return a “on” if checked
value. We want these to be actual JSON booleans so in the middleware is
a functions that checks for “true” or “on” before setting the value to
the boolean <code>true</code>. This will help normalize for changed and
saved records.</p>
<h2 id="reports-implementation">Reports Implementation</h2>
<p>Reports are implemented scripts or programs that are defined in a
YAML file (e.g. reports.yaml). Reports can be slow to run so COLD
implements a naive queue system. The reports.ds collection holds report
requests. Those marked as “requested” are pickup by a runner that then
attempts to executes the report. Since reports are running as
executables on the system outside the runner reports MUST be defined in
the YAML configuration file. There are zero user controlled options.
This removes the attack surface of using COLD’s report system to
compromise the application server. Additionally the scripts/programs
implementing the reports retrieved by a data processing service on a
different service. It is on this machine that the reports are defined.
This machine is not directly accessible by the web and should be
configured to restrict non-campus network access as an additional step
to minimize the attack surface.</p>
<p>The report scripts/programs should return an error message or link
where the reports can be picked up. This is be used by the runner to
resolve the final report request status.</p>
<p>The individual reports can be written in your language of choice
(e.g. Python, Bash, TypeScript). The primary requirements are reports
are responsible for storing their results and providing a link or error
message to standard out when completed. Since they are just programs
that write results to standard out they are able to interact with any
necessary systems they are allowed to talk to (e.g. databases, external
services, etc).</p>
<p>A garbage collections script should clear out old requests in a
timely fashion (e.g. once a week or once a month).</p>
<h3 id="requests-and-runner">Requests and Runner</h3>
<p>A request queue is implemented track report requests via the COLD UI.
A separate process reads the queue, renders the reports and then updates
the queue upon completion or error. If email addresses are provided then
they will be contact with the result of the report request. The message
should include the report’s request id, name, status and link or error
message.</p>
</section>
<footer>
<span>© 2022 <a href="https://www.library.caltech.edu/copyright">Caltech Library</a></span>
<address>1200 E California Blvd, Mail Code 1-32, Pasadena, CA 91125-3200</address>
<span><a href="mailto:[email protected]">Email Us</a></span>
<span>Phone: <a href="tel:+1-626-395-3405">(626)395-3405</a></span>
</footer>
</body>
</html>