Skip to content

Commit

Permalink
Improve "not queued or running" error message.
Browse files Browse the repository at this point in the history
Include the job ID for debugging purposes.

Use a somewhat different error message if the job had started up
successfully compared to if it had not.  The debugging steps are
generally different for jobs which started but then died unexpectedly
vs. ones that didn't start up at all, and the "not queued" part is
just confusing in the case of jobs which had started up.
  • Loading branch information
adam-azarchs committed Jul 8, 2024
1 parent 20a0195 commit ee01558
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 5 deletions.
2 changes: 1 addition & 1 deletion jobmanagers/retry.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"^error: .Errno 12. Cannot allocate memory",
"^error: JSV stderr: error: commlib error: got select error (Connection refused)",
"^Unable to run job: failed receiving gdi request response",
"^According to the job manager, the job for .+ was not queued or running,",
"^According to the job manager, the job for .+ has not been (?:queued or )?running",
"^IOError: \\[Errno 116\\] Stale file handle",
"^OSError: \\[Errno 11\\] Resource temporarily unavailable",
"^jobcmd error \\(exit status \\d+\\)",
Expand Down
17 changes: 13 additions & 4 deletions martian/core/metadata.go
Original file line number Diff line number Diff line change
Expand Up @@ -846,10 +846,19 @@ func (self *Metadata) endRefresh(lastRefresh time.Time) {
// The job is not running but the metadata thinks it still is.
// The check for metadata updates was completed since the time that
// the queue query completed. This job has failed. Write an error.
err := self._writeRawNoLock(Errors, fmt.Sprintf(
"According to the job manager, the job for %s was not queued "+
"or running, since at least %s.",
self.fqname, notRunningSince.Format(util.TIMEFMT)))
var err error
if state == Running {
err = self._writeRawNoLock(Errors, fmt.Sprintf(
"According to the job manager, the job for %s (%s) has "+
"not been running since at least %s.",
self.fqname, jobid, notRunningSince.Format(util.TIMEFMT)))
} else {
err = self._writeRawNoLock(Errors, fmt.Sprintf(
"According to the job manager, the job for %s (%s) "+
"has not been queued or running since at least %s, but "+
"it does not appear to have started successfully.",
self.fqname, jobid, notRunningSince.Format(util.TIMEFMT)))
}
if err != nil {
util.LogError(err, "runtime",
"Error writing error message about cluster-mode job not running.")
Expand Down

0 comments on commit ee01558

Please sign in to comment.