-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed and redefined total peering time (%<tt) #1828
Changes from 31 commits
b2193b4
2313710
8688bb5
9ee4c5a
c834331
a71a763
182b682
a2c176f
c2637ca
94bb99b
0a2b9b4
9e192dc
c8ac2b7
5fb6a31
86033b5
9340dac
8092565
a693219
b9347ee
8c86590
80eed21
a9c2c01
d4b420d
fe72ba9
fa32752
95c3e8a
48b31b2
f3ee47b
95f59c5
8755344
2ccfd0a
02cf233
c7236dc
4443bdf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,40 @@ void ResetMarkingsToServer(HttpRequest *, Comm::Connection &); | |
|
||
class HelperReply; | ||
|
||
/// Eliminates excessive Stopwatch pause() calls in a task with multiple code | ||
/// locations that pause a stopwatch. Ideally, there would be just one such | ||
/// location (e.g., a task class destructor), but current code idiosyncrasies | ||
/// necessitate this state. For simplicity sake, this class currently manages a | ||
/// Stopwatch at a hard-coded location: HttpRequest::hier.totalPeeringTime. | ||
class PeeringActivityTimer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have actually started with a much more generic ActivityTimer class (even suitable for src/base/), but ran into concerns that TunnelStateData creates its ActivityTimer data member with an address of PR's PeeringActivityTimer is simple and reliable (no raw pointers). It can be generalized if we need to support other HierarchyLogEntry or even ALE members (making code more complex). |
||
{ | ||
public: | ||
PeeringActivityTimer(const HttpRequestPointer &); ///< resumes timer | ||
~PeeringActivityTimer(); ///< \copydoc stop() | ||
|
||
/// pauses timer if stop() has not been called | ||
void stop() | ||
{ | ||
if (!stopped) { | ||
timer().pause(); | ||
stopped = true; | ||
} | ||
} | ||
|
||
private: | ||
/// managed Stopwatch object within HierarchyLogEntry | ||
Stopwatch &timer(); | ||
|
||
/// the owner of managed HierarchyLogEntry | ||
HttpRequestPointer request; | ||
|
||
// We cannot rely on timer().ran(): This class eliminates excessive calls | ||
// within a single task (e.g., an AsyncJob) while the timer (and its ran() | ||
// state) may be shared/affected by multiple concurrent tasks. | ||
/// Whether the task is done participating in the managed activity. | ||
bool stopped = false; | ||
}; | ||
|
||
class FwdState: public RefCountable, public PeerSelectionInitiator | ||
{ | ||
CBDATA_CHILD(FwdState); | ||
|
@@ -212,6 +246,9 @@ class FwdState: public RefCountable, public PeerSelectionInitiator | |
/// Whether the entire reply (including any body) was written to Store. | ||
/// The string literal value is only used for debugging. | ||
const char *storedWholeReply_; | ||
|
||
/// Measures time spent on selecting and communicating with peers. | ||
PeeringActivityTimer peeringTimer; | ||
}; | ||
|
||
class acl_tos; | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -4995,10 +4995,33 @@ DOC_START | |||||
[http::]<pt Peer response time in milliseconds. The timer starts | ||||||
when the last request byte is sent to the next hop | ||||||
and stops when the last response byte is received. | ||||||
[http::]<tt Total time in milliseconds. The timer | ||||||
starts with the first connect request (or write I/O) | ||||||
sent to the first selected peer. The timer stops | ||||||
with the last I/O with the last peer. | ||||||
[http::]<tt Total peering time in milliseconds. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "peering time" is not a defined concept, and is quite obscure - not being exclusive to peers, nor requiring their use at all. AFAIK other documentation calls this "forwarding" not "peering", so:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Squid documentation and code uses "peer" to mean "origin server or cache_peer". The term "peering" itself is also used. I think it is a better approach than spelling out "origin server or cache_peer" or forgetting one or the other. All of this is inconsistent and poorly defined, of course, but this PR is not making things worse. Arguably, it even improves the situation a bit. Said that, I have changed that brief There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If that were true, you would not have insisted on me writing "server or peer" in all previous PRs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I do not know what "it" refers to, but can we focus on moving forward this stalled PR rather than on my alleged past actions in other PRs, please? AFAICT, this change request has been addressed two months ago, using phrasing nearly identical to the one you have suggested. Please resolve this specific change request and, if possible, dismiss your negative PR review. |
||||||
|
||||||
The timer starts when Squid decides to forward the request (to | ||||||
an origin server or cache_peer) and peer selection begins. The | ||||||
timer stops when relevant forwarding activities (including any | ||||||
retries) end. | ||||||
|
||||||
Between those two timer events, Squid may perform DNS lookups, | ||||||
query external ACL helpers, adapt responses using pre-cache | ||||||
RESPMOD services, and participate in other concurrent | ||||||
secondary activities. Most secondary activities increase | ||||||
peering time. In some cases, a secondary activity may start | ||||||
before the timer starts or end after the timer stops, leading | ||||||
Comment on lines
+5006
to
+5011
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of this is essentially saying that this timer follows wall-time clock, not a CPU-cycles clock. This list of things that can happen parallel or in sequence with an HTTP response arriving is already quite long and likely to have bitrot quite easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I strongly disagree. The text gives the reader a good idea of what activities are included in
The list has 3 specific examples. I disagree that three is "quite long". There is no significant danger of bitrot here because those three examples are very unlikely to go away, and the list is explicitly not exclusive.
The current version is the best I can come up with. Its (moderate, IMHO) level of complexity is necessitated by the complexity of the thing being measured; a simpler version is already available as a brief There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nobody likes the text, but it's a text and thus better than nothing. |
||||||
to misleading results of simple computations like %<tt - %dt. | ||||||
Comment on lines
+5010
to
+5012
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A secondary activity may start before the transaction peering timer when a transaction is joining an already pending secondary activity. For example, an HTTP request handled by FwdState ( Similarly, a secondary activity may continue beyond a peering transaction phase. For example, a DNS A lookup may still be pending while the transaction proceeds with the DNS AAAA lookup result received earlier (due to Happy Eyeballs optimizations). These natural imperfect/unpredictable concurrent activity/transaction overlaps imply that |
||||||
|
||||||
If this logformat %code is used before its timer starts, the | ||||||
corresponding measurement has no value (and the %code expands | ||||||
to a single dash ("-") character). If this code is used while | ||||||
its timer is running, the time spent so far is used as the | ||||||
measurement value. | ||||||
yadij marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
When Squid re-forwards the request (e.g., due to an HTTP 304 | ||||||
response indicating that the cached response is stale), the | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small problem:
Can you point me at the logic where this example is happening in latest Squid? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reforwarding cases correspond to two processMiss() calls in clientReplyContext::handleIMSReply(). There is also an XXX there that (if addressed) would probably result in another processMiss() call. However, the example text does not match those cases. I fixed that bad example. |
||||||
timer may restart. In this case, the new measurement is added | ||||||
to the value accumulated from previous forwarding attempts. | ||||||
The time interval between forwarding attempts is not added | ||||||
to the final result. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR preserves Squid behavior documented in this paragraph. The cumulative nature of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW; I believe it was clearly implied by the old text stating the value as covering time from start of I/O with peer A to end of I/O with peer B. I do not think we need to go into the level of internal logic this 3rd paragraph has. All reasons for any re-forwarding, re-tries, and redirection are covered by and/or already alluded to by the new paragraph 2 statement " There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This text is not about switching from peer A to peer B (i.e. retrying peers using one FwdState instance). It is about re-forwarding requests using two FwdState instances. In some of those two FwdStates cases, Squid will use peer A, then peer B, and then peer A again and peer B again. The original description did not document what would happen in those cases. An admin could guess correctly, but they had similar if not greater chances of guessing wrong.
I think we do need to detail that because paragraph 2 only documents that the timer starts at A and stops at B. That documentation is not wrong (for some definition of a "retry"1), but it does not cover what happens when there are two pairs of A and B events (A1 and B1 plus A2 and B2). Interpreting those two pairs as A1 and B2 would be wrong -- the timer is not running between B1 and A2. If you would like, we can replace "starts" and "stops" in paragraph 2 with "starts (or resumes)" and "pauses", but I am not sure it will improve things -- I think it is best to keep that initial paragraph simple and document special cases later. And even if we do make those replacements in paragraph 2, I would still keep this paragraph (in some shape of form) to explain what is going on in those special (but not so rare in collapsed forwarding environments) re-forwarding cases. Said that, if you insist, I will remove this paragraph. Your call. Footnotes
|
||||||
|
||||||
Squid handling related format codes: | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -635,9 +635,16 @@ Format::Format::assemble(MemBuf &mb, const AccessLogEntry::Pointer &al, int logS | |
break; | ||
|
||
case LFT_TOTAL_SERVER_SIDE_RESPONSE_TIME: { | ||
struct timeval totalResponseTime; | ||
if (al->hier.totalResponseTime(totalResponseTime)) { | ||
outtv = totalResponseTime; | ||
// XXX: al->hier.totalPeeringTime is not updated until prepareLogWithRequestDetails(). | ||
// TODO: Avoid the need for updates by keeping totalPeeringTime (or even ALE::hier) in one place. | ||
const auto &timer = (!al->hier.totalPeeringTime.ran() && al->request) ? | ||
al->request->hier.totalPeeringTime : al->hier.totalPeeringTime; | ||
Comment on lines
+638
to
+641
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Implementing this TODO in this PR would create a lot of out-of-scope noise. That refactoring is also likely to require making some difficult design decisions and may result in other bug fixes or surprises. We should do it in a dedicated PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you have some cases where the AFAIK we should not have hierarchy contact without a request message, and the ALE should be setup with that. Perhapse it is possible to improve this code by just accessing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am not aware of any.
I hope there are better ways to address this problem than "treating ALE::hier as deprecated" (which, The corresponding official code uses ALE::hier instead of ALE::request. Doing the opposite without serious analysis/tests/etc. is risky (and this PR does not take that risk). ALE::hier is used in several code contexts, including contexts unrelated to peering time. Ignoring it here may introduce bugs and/or more inconsistencies. Let this problem be handled in a dedicated PR (which will focus on addressing these issues properly rather than taking unnecessary risks). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this PR is the right place to deduplicate |
||
if (timer.ran()) { | ||
using namespace std::chrono_literals; | ||
const auto duration = timer.total(); | ||
outtv.tv_sec = std::chrono::duration_cast<std::chrono::seconds>(duration).count(); | ||
const auto totalUsec = std::chrono::duration_cast<std::chrono::microseconds>(duration); | ||
outtv.tv_usec = (totalUsec % std::chrono::microseconds(1s)).count(); | ||
Comment on lines
+645
to
+647
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code converts an std::chrono::duration to struct timeval. Eduard has evaluated runtime efficiency of this code and confirmed that its optimized performance is equivalent to traditional manual division using magic constants. We will submit a dedicated followup PR offering a reusable version of this conversion, replacing this and other then-current code doing this conversion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO code should not be doing this conversion at all. In here at least it should be displaying the duration object, not a timeval. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that no conversion to timeval should be necessary, but improving Format::assemble() to handle std::chrono durations without such conversion should be done in a PR dedicated to that enhancement. This PR does not even introduce an std::chrono conversion to Format::assemble() -- it already exists in official Format::assemble() code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd resist scope creep here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Done in #1915. |
||
doMsec = 1; | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -211,6 +211,9 @@ class TunnelStateData: public PeerSelectionInitiator | |
/// over the (encrypted, if needed) transport connection to that cache_peer | ||
JobWait<Http::Tunneler> peerWait; | ||
|
||
/// Measures time spent on selecting and communicating with peers. | ||
PeeringActivityTimer peeringTimer; | ||
|
||
void copyRead(Connection &from, IOCB *completion); | ||
|
||
/// continue to set up connection to a peer, going async for SSL peers | ||
|
@@ -310,7 +313,7 @@ TunnelStateData::serverClosed() | |
{ | ||
server.noteClosure(); | ||
|
||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
finishWritingAndDelete(client); | ||
} | ||
|
@@ -386,14 +389,25 @@ TunnelStateData::deleteThis() | |
delete this; | ||
} | ||
|
||
// TODO: Replace with a reusable API guaranteeing non-nil pointer forwarding. | ||
/// safely extracts HttpRequest from a never-nil ClientHttpRequest pointer | ||
static auto & | ||
guaranteedRequest(const ClientHttpRequest * const cr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are lots of places in Squid code where we store a raw pointer and use it for configuring other members. Some of them assert that the pointer is not nil (and violate data member initialization guidelines to inject those assertions). Some just assume that the pointer is not nil (and occasionally crash). I think we can do better. STL offers some relevant guidance with throwing (but slower) std::optional::value() and std::vector::at() interfaces, but raw pointers will need a different solution. An investigation of various options and a dedicated PR is warranted here. That PR will, among other things, allow us to remove this guaranteedRequest() helper and simplify/improve constructors like TunnelStateData constructor below (that contains similar assertions). |
||
{ | ||
Assure(cr); | ||
Assure(cr->request); | ||
return *cr->request; | ||
} | ||
|
||
TunnelStateData::TunnelStateData(ClientHttpRequest *clientRequest) : | ||
startTime(squid_curtime), | ||
destinations(new ResolvedPeers()), | ||
destinationsFound(false), | ||
committedToServer(false), | ||
n_tries(0), | ||
banRetries(nullptr), | ||
codeContext(CodeContext::Current()) | ||
codeContext(CodeContext::Current()), | ||
peeringTimer(&guaranteedRequest(clientRequest)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could simplify a little (eliminating referencing/dereferencing) by making guaranteedRequest() return a pointer: its very name states that there is an existing request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we could, but I did not want a function that guarantees request existence to return a pointer to that existing request: If the object is guaranteed to exist, its accessor should return a reference rather than a pointer. Either way, this code is problematic as detailed in #1828 (comment). I hope it will go away with changes discussed in that change request. Said that, if others insist, I will refactor the helper function to remove this |
||
{ | ||
debugs(26, 3, "TunnelStateData constructed this=" << this); | ||
client.readPendingFunc = &tunnelDelayedClientRead; | ||
|
@@ -477,8 +491,7 @@ TunnelStateData::retryOrBail(const char *context) | |
|
||
/* bail */ | ||
|
||
if (request) | ||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
// TODO: Add sendSavedErrorOr(err_type type, Http::StatusCode, context). | ||
// Then, the remaining method code (below) should become the common part of | ||
|
@@ -1403,8 +1416,7 @@ TunnelStateData::sendError(ErrorState *finalError, const char *reason) | |
{ | ||
debugs(26, 3, "aborting transaction for " << reason); | ||
|
||
if (request) | ||
request->hier.stopPeerClock(false); | ||
peeringTimer.stop(); | ||
|
||
cancelStep(reason); | ||
|
||
|
@@ -1438,15 +1450,12 @@ TunnelStateData::cancelStep(const char *reason) | |
void | ||
TunnelStateData::startConnecting() | ||
{ | ||
if (request) | ||
request->hier.startPeerClock(); | ||
|
||
assert(!destinations->empty()); | ||
assert(!transporting()); | ||
|
||
delete savedError; // may still be nil | ||
savedError = nullptr; | ||
request->hier.peer_reply_status = Http::scNone; // TODO: Move to startPeerClock()? | ||
request->hier.peer_reply_status = Http::scNone; | ||
|
||
const auto callback = asyncCallback(17, 5, TunnelStateData::noteConnection, this); | ||
const auto cs = new HappyConnOpener(destinations, callback, request, startTime, n_tries, al); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would prefer to rely exclusively on RAII (provided by PeeringActivityTimer) instead of sprinkling code with these stop() calls. Unfortunately, FwdState objects are reference-counted, and there is no strong/obvious/maintained guarantee that a FwdState object does not survive long after stopAndDestroy() has been called. Thus, we should call stop() explicitly. RAII will still kick in if we miss a destruction code path, of course.