-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathCHANGES
executable file
·690 lines (573 loc) · 32.3 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
= Changes from previous Cobalt Versions =
== Changes to 1.0.33 ==
* Fix for a bug where upgrading from significantly older versions of Cobalt
with existing reservations set will cause those reservations to be
unresponsive.
== Changes to 1.0.32 ==
* Added a utility qsub2pbs which takes Cobalt qsub command line arguments
and provides equivalents for PBSPro/OpenPBS.
== Changes to 1.0.31 ==
* Fixed a race condition where a K-accounting record could be missed if a
reservation was manually removed as it was ending.
* Reservation accounting records have had numerous fields added.
* cluster_systems now respect the --attrs location attribute with a
','-delimited list of nodes.
* Job running location has been removed from Cobalt email subject lines. Large
jobs would cause the subject line to exceed the maximum permitted length and
could cause the email to be dropped.
* Added system-related PBS-style accounting records for node up and node down
in cluster systems.
== Changes to 1.0.30 ==
* Fixed a race condition for Cray systems where a job draining may be
disrupted briefly on job termination, allowing other jobs to run
instead.
== Changes to 1.0.29 ==
* A timeout for automatic functions that was forcing a once-per-ten-seconds
cycle time has been relaxed. The default is now much faster and should
allow for improved job startup rates if schedule interval is set to a
time.
* A lock has been relaxed that should allow for most Cobalt commands to
be responsive while ALPS updates are being fetched for Cray systems.
== Changes to 1.0.28 ==
* Fix an issue where deferring a cyclic reservation could result in the
reservation start time being sent two cycle periods forward rather
than one.
== Changes to 1.0.27 ==
* Security fix for Cray systems.
== Changes to 1.0.26 ==
* Interactive mode jobs on cluster systems no longer get the Cray-style
double shell execution.
* Fixed an error in the logging timestamps of Cray KNL memory reboot scripts
* Added extended accounting log records for Jobs. Holds are now tracked and
most records have additional information on job specification and resource.
* Added in accounting records for Cobalt reservations.
* Fixed double reporting of all-holds-clear on a dependency hold release.
* Fixed a logging bug in the event that a job fails to create a
.cobaltlog file.
* $COBALT_JOBID now properly expanded for .cobaltlog files
* Users may see a start time estimate for their jobs in qstat by adding
the start_time_est header to their header option. This is a conservative
naive start time estimate based on the running job times and the number of
machine-hours of jobs ahead of a given job in the queue. The minimum of this
estimate may be configured by sites to allow for potentially long resource
cleanups.
* A utility, find-location, has been added. This can be used to find a
location on cluster systems or Cray systems that can be used for setting
reservations or as a location attribute. This may be set to allow or avoid
currently down nodes, restricted to queues, avoid reservations, and on Cray
systems may be restricted to specific node attributes.
* An update to the Cray Cobalt initialization script has been made to tag the
component output redirect with a year and month for easier management of this
log.
== Changes to 1.0.25 ==
* Fixed not writing a TE record for jobs that had started running but
were administratively force-killed.
* Fixed an issue where an error during run initialization was not
entering the Run_Retry state correctly and would cause a job to fail.
* Fixed a condition that could cause a node to be allocated to two jobs
simultaneously on cluster systems if a job is selected to run in both
the primary and backfill scheduling passes.
* Fixed a race condition where a job that was starting up and deleted by a
user could set up a situation where a node would be freed twice, allowing
for a double-allocation of a node.
* Logging for component communication, especially with location lookup has
been greatly enhanced.
== Changes to 1.0.24 ==
* Cweb now reports jobs in the starting state.
* Fixed an issue where a process exit status could be fetched twice, and
overwrite the actual exit status of a job with an erroneous value.
* Heckle and Gram support has been removed.
== Changes to 1.0.23 ==
* Support for cgroups via cgclassify has been added. This is off by default.
Currently, this is only supported for Cray systems, and is intended for systems
that do not have kernel support for using proc connector.
* Fixed an unhandled UnboundLocalError that could arise when querying child status
when interacting with apbasil.
== Changes to 1.0.22 ==
* Fixed error on Cray systems that prevented cqadm --run from working on those
systems.
* Forkers that are de-registered while Cobalt is running, such as a login node
crashing will now be properly removed from the pool for running jobs, until
they are brought back up and re-registered on Cray systems. Jobs that were
running on that forker will be cleaned up (or will continue running
normally) once the forker is re-registered with the rest of Cobalt.
* The default signal sent to processes at the end of a job in Cray systems is
now SIGTERM not SIGINT.
* Cobalt will now detect SSDs on Cray XC40 systems through the CAPMC
interface. This allows users to request to be placed on nodes with optional
SSDs and allows users to specify a minimum free size for their job. By
default no SSDs are requested.
== Changes to 1.0.21 ==
* Fixed numerous locking issues in job startup that could result in a small chance
of failed job initialization.
* Fixed an issue on Cray systems where brief network disruptions on service nodes
can cause a loss of contact with a forker resulting in a loss of the running jobs.
* Added additional logging for ProxyErrors. The original trace/error message should
now be presented to the caller in the event of a ProtocolError.
* Fixed an issue in the cdbwriter where an error in writing job attributes could
cause a recurring error loop preventing database updates.
== Changes to 1.0.20 ==
* Fix for certain IO error messages that were ambiguous.
* Fix for an issue where attrs on a job are changed as a part of a filter script while
using qmove that can cause attrs to become a string instead of a dictionary and
disrupt scheduling on a queue.
== Changes to 1.0.19 ==
* Hotfix for issue preventing rereservations for interactive jobs from being detected.
== Changes to 1.0.18 ==
* Fix for a hang that can occur with interactive jobs if a Cray ALPS reservation
needs to be recreated at job execution time, usually in a rebooting context.
* Fix for a display issue when using the cluster_system component when the cluster
has nodes with a non-uniform prefix, or no prefix at all. This changes some
existing numbering condensation as well, and allows the location string for
jobs and reservations to be used with tools like pdsh
== Changes to 1.0.17 ==
* Added queue and project information to environment in Cray jobs.
== Changes to 1.0.15 ==
* Fix for an issue on Cray systems where Cobalt was not properly passing
a job's node list into the ALPS re-reservation. This could cause a mismatch
in the nodes that Cobalt was actually running a job on and what it thought the job
was running on.
== Changes to 1.0.14 ==
* Fix for cluster systems where nodes that weren't numbered could cause
showres/qstat to fail to error out when displaying locations.
* TS has location added, TE records have location, and seconds from epoch UTC
start and end for tasks
* Memory management scripts have been updated so that they can handle larger
Cray XC40 systems.
* Memory management boot records have been enhanced with start/end seconds from
epoch UTC times and blocked locations on the boot-start record.
* Fix for a potential memory mode script switch crash when rebooting from certain
memory mode mixes.
* Group membership checks for queues are now being handled more efficiently.
* Boot start and boot end records from the memory mode management scripts now have
a corrected date format in the message timestamp that is consistient with other
accounting records.
== Changes to 1.0.13 ==
* Added a web-service interface "cweb" client. This presents job, reservation
and node data over a REST API. This is based on the gronkd daemon from
the ALCF.
* Added TaskStart and TaskEnd (TS and TE) PBS accounting records. These are
specific to Cobalt. They correspond to when the user task is started/ended
by the queue-manager.
* Enhanced the Cray CAPMC boot scripts and filters. PBS style logs are now
provided wtih Boot Start and Boot End records.
* Enhanced the system component node allocator to try and prefer nodes that
match a job's requested memory modes to avoid unnecesssary memory mode shift
reboots.
* alpssystem now updates node attributes from ALPS. Notably this updates the
MCDRAM(HBM) and NUMA information on the nodes. This happens at the standard
hardware update interval.
== Changes to 1.0.12 ==
* The BlueGene/Q now has a more optimistic backfilling algorithm that also
properly accounts for exactly which jobs are currently draining.
== Changes to 1.0.11 ==
* Fix for a race condition on Cray systems where a correctly timed qdel can
cause the nodes for a job to get stuck in cleanup.
== Changes to 1.0.10 ==
* Sites can now specify the qsub comamnd for mom nodes on Cray systems in case
it is in a non-default location. This also fixes a permission issue for
eLogin interactive jobs.
* Preemptive fix for a possible process leak from AlpsBridge.
* Preemptive fix for a possible loss of some output when using string_stdout
in the forkers.
* Adding in DDL for a summary view for reservations for CDB.
* Fix for a failure in loading very old statefiles into system_script_forker
if there are prior runs left to clean up on restart.
== Changes to 1.0.9 ==
* Cobalt database writer now working properly with Cray Systems.
* Updates to the Cobalt Database DDL build scripts
* Interactive jobs enhanded to work in Cray's eLogin environment. Users will
now be sent to a node that can run a real aprun prior to their shell being
launched.
* The Cray init script will now tag all redirected console output files with
the originating node.
* Components now log what host they are launched on at startup
== Changes to 1.0.8 ==
* Fixed a bug where a node going down during a running job could allow that
node to be selected for draining.
* Expanded a buffer used by the forkers that should significantly improve the
speed of reading large ALPS responses and no longer interfere with node
reacquisition
* Cleaned up stray print statements in node information
* Fixed an error where score could be set to a string rather than a float via
the filter scripts
* Fixed an error where the queue-manager statefile could be lost if Cobalt was
brought down with a job in the "Terminal" state.
== Changes to 1.0.7 ==
* Updated a new Cray-system specific init script.
== Changes to 1.0.6 ==
* Backfill now supported on Cray systems.
* Major performance improvement to nodelist and nodeadm -l.
* Fix for an error that can cause loss of job tracking during job startup
resulting in a reported exit status of "1234567".
* Fix for an error in system_script_forker that can cause an error loop
if using stdout to string functionality.
* Fix for an error that caused jobs with a "terminating" status to cause
qstat to fail.
== Changes to 1.0.5 ==
* Fix for an issue that would cause jobs with their location set to not
run if they were submitted to a reservation.
* Performance of setres improved significantly.
* Fixed an issue where reservations and soft-down status on nodes may not
be respected by job placement.
* An enhanced_apkill script is now available to improve job cleanup and
ensure that jobs that ignore signals are otherwise properly terminated.
== Changes to 1.0.4 ==
* Fix for apid not being properly checked when apkill was being invoked.
== Changes to 1.0.3 ==
* Fix for an issue for nodes coming out of disabled state and suddenly
appearing in ALPS.
* Interactive job support
* Support for CAPMC memory management scripts added. Cobalt now gracefully
handles long startup times if these are being used.
* Improved job cleanup using apkill.
== Changes to 1.0.2 ==
* Singleton job filter (maxtotaljobs) now supported
* Multiple alps_script_forkers running simultaneously is now supported. Forkers
are chosen for execution based on which forker has the lowest number of jobs
currently assigned to it.
== Changes to 1.0.0 ==
* Initial support of Cray systems
== Changes to 0.99.42 ==
* Update to database writer to handle attrs fields that are
improperly formatted.
== Changes to 0.99.41 ==
* Minor patch to clean up an issue where the user was not being properly
cleared from BlueGene blocks at the end of a run.
== Changes to 0.99.40 ==
* A nokillpg attribute may be added to a job to cause kill instead of killpg
to be used for job termination so that a signal handler in a script may
handle signal propigation itself.
* Fixed a bug where a job requesting a location could cause Cobalt to drain
for a job that would require the use of offline hardware
* Added a check to the job validator where a job is rejected if the job size
and block geometry have a mismatch.
== Changes to 0.99.39 ==
* Session information added to job submission information. This also includes
submitting host information (Cobalt Ticket #4315)
* A fix for a bug that can cause "phantom drains" to occur on lightly loaded
cluster systems. (Cobalt Ticket #4319)
== Changes to 0.99.38 ==
* Added numerous dependency changes. Notably, jobs in dep_hold states will no
longer accrue eligible wait time or score. Previously, dep_hold jobs would
accrue eligible wait times unlike admin_hold or user_hold states.
(Cobalt Ticket #4308)
* Two display fixes for cluster systems that allow multiple node-name prefixes
to be handled correctly by the nodelist merging code and to correctly
compact names if leading 0's are used in location numbers like
"foo001.machine". (Cobalt Ticket #4313 and #4277)
* Corrected two race conditions in the BlueGene/Q system components where a
location with offline hardware could be selected for draining on a
scheduling pass. Though Cobalt would never try to actually run on the
offline block, It could end up draining resources needlessly, and impact
machine usage. The first could happen at any time, the second only occurred
at the end of booting a block. (Cobalt Ticket #4311)
== Changes to 0.99.37 ==
* Fixed issue where a reservation could hit it's activation check while it is being
deactivated, causing those events should be reversed. Released reservations can't
activate. (Cobalt Ticket #4289)
* COBALT_JOBSIZE and COBALT_PARTSIZE have been added to interactive jobs.
(Cobalt Ticket #4294)
* Job dependencies require the same project to gain score from eachother.
(Cobalt Ticket #4293)
* prologue_helper.py now handles "=" in arguments correctly. (Cobalt Ticket #4292)
== Changes to 0.99.36 ==
* Fix for issues in get-bootable-blocks and with block booting. Block names are now
properly sorted. A race condition between boot-block and update_block_state has
been fixed. The return statuses for boot-block have also been corrected for failed
boots. (Cobalt Ticket #4282 and #4280)
== Changes to 0.99.35 ==
* Fixed an issue where a block being removed in the BGQ control system between block
detection and the block being added for tracking could result in blocks being
erroniously marked as idle. (Cobalt Ticket #4268)
== Changes to 0.99.34 ==
* Fixed an issue where Cobalt could have a memory leak if boot-block.py was
terminated early the boot tracking object may never be cleaned up.
* Numerous enhancements and fixes to handle multiple overlaping queues in the
system with respect to draining and backfilling. (Cobalt Ticket #663)
* Fixed some typos in the documentation for qsub.
* COBALT_RESID is now appropriately set in interactive mode jobs (Cobalt Ticket #703)
== Changes to 0.99.33 ==
* Fixed issue where adding blocks while the system component is running can
cause the system component's block update loop to fail. (Cobalt Ticket #695)
* Preliminary fix for a racecondition that can cause subblock jobs to terminate
early (Cobalt Ticket #690)
* Added T-Minus to showres (Cobalt Ticket #692)
* Fixed an error in qalter where attemtpts to change the users on a running job
caused qalter to fail. (Cobalt Ticket #694)
== Changes to 0.99.32 ==
* Cluster systems now drain for larger jobs, and will backfill into the
draining hardware. (Cobalt Ticket #663)
* Fixed a bug in the prologue_helper where a silent failure on sending the
hostfile to nodes could occur silently (Cobalt Ticket #686)
== Changes to 0.99.31 ==
* Interactive mode jobs (qsub -I) are now supported on the BlueGene/Q.
(Cobalt Ticket #107)
* Timestamps are now included in cobaltlog lines (Cobalt Ticket #344 and #669)
* Multiple mode arguments will result in a qsub rejection (Cobalt Ticket #664)
* Fixed an issue that was causing COBALT_STARTTIME and COBALT_ENDTIME to not be
set for compute node jobs in cX jobs. (Cobalt Ticket #673)
== Changes to 0.99.30 ==
* This is a fix release that corrects a bug that can be encountered when upgrading
to 0.99.27 from prior versions where ion_kerneloptions in cqm is improperly set.
(Cobalt Ticket #665)
== Changes to 0.99.29 ==
* Interactive jobs are experimentally available for cluster systems.
== Changes to 0.99.28 ==
* Cobalt now supports qsub options in script mode jobs similarly to other
job scheduler scripts. The form is #Cobalt <qsub option> <option argument>.
Most qsub options are supported, except for mode. Jobs using this feature must
be script mode jobs. This is similar in style to PBS-directives. (Cobalt Ticket 354)
* Fixed a bug in cluster systems where a failure duing job startup could result
in all nodes for a job being stuck in the allocated state (Cobalt Ticket 662)
* $COBALT_STARTTIME and $COBALT_ENDTIME added to job's environments, these are
the Unix timestamp for when the job started, and when Cobalt will terminate
the job due to exceeding walltime. (Cobat Ticket 593)
* A requested walltime of 0 will set the job to run for the time remaining in
a reservation. (Cobalt Ticket 474)
* The --env flag to qsub now aggregates multiple instances of the flag rather
than overriding the prior instance (Cobalt Ticket 657)
== Changes to 0.99.27 ==
* Alternate kernel support for BG/Q compute nodes and IO nodes is available.
Numerous configuration settings have been added to support this.
(Cobalt ticket 621)
* If $jobid is included in a job name specification or an environment variable,
at submission time, Cobalt will substitute the generated jobid for that job.
See man qsub for more details. (Cobalt ticket 617)
* The Cobalt jobid is now explicitly added in the cobaltlog file. (Cobalt ticket 390)
* The name of a job can now be specified seperately from naming the output
files. (Cobalt ticket 625)
* Multiple locations can be passed to reservations through the -p option as a
':'-delimited list. (Cobalt ticket 405)
* cqadm can now be used to release user hold, as well as put jobs into a user hold
as well as an admin hold (Cobalt ticket 382)
* The users field can be removed from a reservation using setres -m -u '*'. (Cobalt
ticket 381)
* Releaseres should now report the correct partiton count on reservation release
(Cobalt ticket 470)
* Reservations now show time remaining in showres once the reservation is active.
(Cobalt ticket 585)
* Schedctl should handle flipping score and jobid arguments more gracefully and
no longer silently discards invalid options. (Cobalt ticket 502)
* ':' and '=' may now be escaped as '\:' and '\=' in the --envs argument (Cobalt
ticket 589)
* Reservations may now take 'now' for the start time of a reservation (Cobalt
ticket 546)
* "GEOMETRY", "ION_KERNEL" and "ION_KERNELOPTIONS have been added to the JOB_DATA
table of the Cobalt database
== Changes to 0.99.26 ==
* Corrected a bug that was causing messages in the clients that used to go to
stdout to go to stderr instead
== Changes to 0.99.25 ==
* Fixes for clients that were causing erronious default values for proccount
have been fixed.
* The setgid wrapper now calls a site-selectable python interperter with the
'-E' flag by default.
* A bug that could cause TimeRemaining to show negative values has been fixed
* Queues may be added or removed from blocks without overwriting the entire
list with new options in partadm using --rmq or --appq with the --queue option.
== Changes to 0.99.24 ==
* Reverting client changes. Found some issues when deploying to test systems
* --defer flag has been reverted with these as a casualty.
== Changes to 0.99.23 ==
* Client code has undergone a refactor. Numerous bugs have been addressed
with these that were found while improving test coverage of this code
* IO Block booting and automatic control has been added. Please consult the
partadm manpage for details
* This release requires pybgsched-0.1.3
* A user may voluntarily set their job's score to 0 and let it re-accrue using
qalter --defer (Ticket #590)
== Changes to 0.99.22 ==
* Fix for a deadlock that could occur due to a race condition in the
boot-reaping code.
* Fix for a pid collision that could occur after a forced delete of a job in
the system-script-forker component leading to a job getting stuck in the
Job_Epilogue state.
* python 2.7 and it's change to the XMLRPC libraries are now supported.
== Changes to 0.99.21 ==
* Minor fix release to correct a failure mode where we can lose a process group
if a job's boot is delayed.
== Changes to 0.99.20 ==
* Small changeset to fix a race condition that was hanging jobs that came in
on 0.99.19
* Better error logging for situations where the boots and the process group
information haven't lined up yet. This usually happens during high-load and
can result in block boot completion never being detected.
== Changes to 0.99.19 ==
* Manual boot control has been added
* major refactor of compute block rebooting
== Changes to 0.99.18 ==
* Fix for other systems that were seeing erronious geometry data
* SoftwareFailure now accurately reported
== Changes to 0.99.17 ==
* Support for --geometry has been added. This will constrain a job to run
blocks having a specific geometry in nodes, partadm, partlist and qstat have
been updated to show this.
* Support for passthrough blocking reservations has been added.
* Fixed a bug where a job run outside of Cobalt would not properly block resources.
* Removed autoreboot in the event of a block entering a control system
SoftwareFalure state on BG/Q. Normal cleanup should handle this now.
== Changes to 0.99.15 ==
* Fix for an issue that was causing partadm to malfunction when reservations
were set on the system.
== Changes to 0.99.12 ==
* As a note, this is a release of fixes and backported features from BG/Q
* Fixed issues that could cause a bad hardware state to be overridden as
"blocked" allowing the block to be drained
* Fix for a situation where a partition that is inelligible to run due to a
pending reservation is selected for draining.
== Changes to 0.99.11 ==
* Fix for an additional circumstance that can cause a small reservation to
collapse a backfill window to zero.
== Changes to 0.99.10 ==
* Fixed an issue that was introduced in 0.99.6 where children were not being
properly added to a block. This caused strange behavior in reservations,
including failure to run jobs smaller than the reservation or a small job
running on the largest block in a reservation
* Refactored the update_block_state code to operate more similarly to the way
BG/P is currently doing it. This should correct an issue where a block can
have an error state overridden.
* Cleaned up a circumstance where wires between passthrough midplanes can be
added to a block's wire list. This can cause erronious wiring conflicts to be
detected.
* Fixed an issue where duplicate wires were being added
== Changes to 0.99.9 ==
* Fixed an issue where a job could choose a block for draining that is blocked
by a pending reservation. This would cause the backfill time to collapse to zero.
* Added code to take blocks offline if passthrough nodeboards reported their state
as in error. It does not offline them if it is only a compute node down, it must
be the nodeboard itself in error as would be set by a BOARD_IN_ERROR control action.
== Changes to 0.99.8 ==
* Fixed an issue where dimensions without cables (like the A,B, and C
dimensions on a 2 midplane system) would cause the system component to fail
on boot.
== Changes to 0.99.7 ==
* Cable errors are now properly detected
* IO Node errors in available links are detected
* Due to changes, libpybgsched v 0.1.1 is required. This also
means the V1R1M1 driver changes can also be enabled.
== Changes to 0.99.6 ==
* Performance improvements over 0.99.5 in update_block_state and wiring
conflict detection. This makes the partadm commands far more responsive.
* Corrected an error in a block's children that affected cold (non-statefile)
startup.
== Changes to 0.99.0pre36 ==
* Fixed bug in resource reservation that would leave a reserved partition in
the 'idle' state
* Updated the code that sets the kernel profile for a partition to set the
profile even if no profile is currently set
== Changes to 0.99.0pre35 ==
* Fixed bug where pickling IncrID could fail when deleting db related
attributes from the object's dictionary
== Changes to 0.99.0pre34 ==
* Time spent in the hold state is now exported to the utility function
* bgsystem now tracks forked tasks using a (forker, id) tuple instead of just
id in order to prevent id collisions in the active jobs list
* Updated cleanup and resource reservation in bgsystem updated to prevent
multiple partition cleanups for a single job
* Forker classes updated so that logging messages properly identify the
component from which they came
* Added LOGNAME and SHELL to the environment of script jobs
* Resolved a bug in the cdbwriter-api where a job object which had no data
records could endless recurse in _jr_obj.__getattr__ looking for valid fields
* Implemented support for common job ID pool through database ID generation
* PYTHONPATH may now be explicitly set when building the wrapper program
* Fix for cobalt-mpirun where the -env flag was being improperly handled
* Updated Cobalt's database documentation diagrams
== Changes to 0.99.0pre31 ==
* Cleaned up reservation logging and database logging
* Reservations can now be tagged by adding a -A flag to setres
* A view of reservations with resid, cycleid and project can be obtained via
the -x flag to showres.
== Changes to 0.99.0pre30 ==
* Improved logging of reservations to the cobalt database.
* "ending" event changed to "terminated" indicates that the reservaiton
has been ended and cleaned up.
* added "deferred" event for when reservations are deferred (setres -D)
* added "deactivating" event for when reservation come to a natural end
(i.e. run out of time)
* added "releasing" event for a user-requested release of a reservation
* deferrals of reservations now no longer cause new reservation_data
table entries to be generated. The subsequent cycle does that.
* cluster_system has had a variety of enhancements. Notably:
* setting simulation_mode true in the cobalt config file will allow the
cluster_system component to enter a test mode, where it can be run
on a development machine without a supporting cluster.
* nodes are now recognized as being allocated when a location is selected
and sent back to the scheduler component as a runnable location.
A timeout is set for the partition, and by default this is 300 seconds
(5 minutes). This can be changed reset at execution time from the
cobalt config file. Should resources be allocated, but the job be
terminated prior to a run (prescript failure due to a dead node, or
a hasty user kill, for instance), this will ensure that the nodes are
released.
* Jobs submitted to cluster_system will now have arguments passed to the job
recognized, as they are on BlueGene type systems.
== Changes to 0.99.0pre29 ==
* Arguments that get passed to scripts (not script jobs but pre and
postscripts) are now being properly escaped.
* Cobaltlogs are now being written to by cqm in a separate thread, and
should allow scheduling to continue in the event of a filesystem hang.
* qalter has had a bug fixed where it was passing back nodecounts and proccounts
as a string rather than an int.
* jobs in cqm that do not have a timer are appropriately destroyed rather
than entering into an infinite loop of attempting to run the job.
* Fix for the situation where a resid can be duplicated in the case of a
cycling reservation
== Changes to 0.99.0pre28 ==
* Modification to the XMLRPC base proxy class that allows you to turn off
automatic retries from the cobalt side on the proxy by requesting no
retries. This should be used for non-idempotent tasks like job-launching.
* Modified cqm's retry behavior to include a runid when running a job, so and
retrying so that bgforker doesn't try to start the same job twice.
* Timezone display corrected. If available, cobalt will try to use the pytz
library (Olsen Timezone Database).
== Changes to 0.99.0pre27 ==
* Restored COBALT_JOBID and COBALT_RESID in back-end jobs. This was erroniously
removed in 0.99.0pre26.
== Changes to 0.99.0pre26 ==
* Fixed bug where environmental variables passed into a job from the --env flag
of qsub and qalter would leak into the environment of bgforker and from
there into subsequent mpirun process environments. These did not leak
into the backend-job environments. Cluster-system component jobs did not
see issues with this due to an apparent bug in envrionment handling.
* A job's exit status is now added to the Cobalt Database as exit_status
* COBALT_RESID is set in mpirun jobs
== Changes to 0.99.0pre25 ==
* Various timezone formatting changes including:
* setres takes the reservation time depending on what TZ is set to
* qstat and showres show times in a unified format, that includes
offset and three-letter timezone designation
* showres has a flag --oldts where it shows timestamps in the original
format for compatibility with old scripts
* Forker was not handling spaces in arguments to scripts correctly
* schedctl no longer gives an erronious error message about requiring a job id
for --start, --stop
* Statemachine state-transition function pointers are now regenerated at
restart as well as job initialization, to allow for transparent state-
machine adjustments
== Changes to 0.99.0pre24 ==
* Fixes for arguments to pre and post scripts for jobs
* showres reverted to not include id's
* Fixes for script-mode jobs
== Changes to 0.99.0pre23 ==
* Script jobs run with their own shells from forker to prevent 32/64 bit execution environment issues
== Changes to 0.99.0pre22 ==
* Bgforker modified so that helper scripts can be run
* KNOWN ISSUE: job preemption is not working correctly in this version.
* Resid now added to jobs at runtime, if they are run in a reservation.
* Fix for partially overlapping partitions being scheduled at the same time
and properly detecting reservations.
== Changes to 0.99.0pre21 ==
* Fix to reduce chance of cluster system nodes from being hung-up on job exit.
* Adding in backfill prediction functionality. This is by default disabled.
* Various fixes and updates to the mk_* cobalt backup scripts. These can be
used to restore state should a change that is destructive to statefiles is
made (like the 0.99.0pre16 to 0.99.0pre17 change).
* resid and cycleid can be set via setres
* time.sleep() wrapped to prevent kernel level IOError from creeping out on
ppc64 linux platforms.
== Changes to 0.99.0pre20 ==
* stderr log messages now contain timestamps
* ensemble jobs that use -nofree without a -wait free are now cleaned up properly
* environment variables in cert, key and ca options in config files are expanded