Define active and passive modes #15

codefromthecrypt · 2019-09-27T01:44:17Z

While an internal detail to some, there's use noting that not all sampling keys will be provisioned at the head of the network (gateway). In some implementations, it will be easier to perform provisioning of the key at the first sampled hop.

Let's define in the doc "active" and "passive" participation.

"active" is when a participant creates instructions for downstream directly or indirectly. For example, they sample the first request on a key, possibly also creating that key, and as a side-effect add the "spanId" parameter.

"passive" refers to a node that only participates if something upstream has (ex via presence of the "spanId" parameter.

Note: there's a difference between "active" sampling and provisioning of the sampling key itself.

Some deployments will want to limit the control of sampling keys to a gateway role. In those cases, the sampling key will be provisioned, but not sampled unless it literally was the policy to also record the gateway itself. In that scenario, a key would be passed unaltered downstream until a participant activates it by sampling (side-effect being the presence of the "spanId" field.

A short-cut of this is where policy is decentralized and all nodes have special knowledge about upstream properties like user-id, or do not need them to make a decision. In this case, a sampling key would be provisioned at the same place in which it is sampled. In deployments like this, they will never see sampling keys unless they were sampled, in other words.

I think we can't rely on passive only (ex provisioning only at the first point sampled), as that spreads the control plane logic to all nodes. In b3 today, we already know people do things like control IDs upstream for reasons of control and data access.

Also, in many public facing sites, I would expect data available at the first hop to not necessarily be available later. Rather than propagating for example the initial user-id, or IP address etc externally, it could be simpler for those sites to evaluate in the gateway the sampling key and pass it along. This lowers the data dependencies, and also centralizes logic. OTOH, this wouldn't prevent doing the same when a site has the ability and interest to also propagate extra fields everywhere. There are pros and cons.

To elaborate two potential deployment options of auth,cache and not TTL:

Where your gateway controls all sampling keys, it (provisions) and later auth service [samples] that key.

(gateway) -> api -> [auth] -> [cache] -> authdb

The other option is when you have all the data you need externally pushed to auth to make that decision, or the auth needs no extra data (like user id etc). This mode looks like a shortcut as (provisioning) of the key is delayed to the same hop that [samples] it.

gateway -> api -> [(auth)] -> [cache] -> authdb

moved from #14 (comment)

connectwithnara · 2019-09-27T06:24:01Z

Adding some more thoughts to this discussion:

(gateway) -> api -> [auth] -> [cache] -> authdb

This implies sampling_key is sent to all downstream, potentially allowing more than one ACTIVE participants. For example, authdb, the last hop, could also trigger a new sampling rate which is different from auth the previous ACTIVE participant?

gateway -> api -> [(auth)] -> [cache] -> authdb

In this setup, the presence of sampling_key imply this requests is matched and is also sampled and the downstream services can just apply ttl or dynamic config rule to participate or not. Checking the spanId is good thing to verify. However the tracing system operator will know for the fact by looking at the sampling_key value that the auth is configured to do perform both request matching and sampling. So checking the spanId is a nice to have verification.

codefromthecrypt · 2019-09-27T06:31:21Z

Hi, nara. I don't think nodes overriding rules is specific to either mode. If either mode a node decides to ignore their configuration they can. Exactly what you said about passive is possible in active if you stretch out the service graph and choose different nodes.

Right now, there are tracing systems which trigger based on data generated upstream. For example, the first use of this was the "debug" flag sent by chrome plugin to zipkin. It is state that pre-empts the tracing, and it needs to be carried somehow. To take a relationship to something else user-originated, you can consider haystack-blobs, which has a UI where users decide to start attaching request/response to sampled requests. It is nice that you don't feel the need for these upstream use cases, but at some point, just like in b3 today, secondary decisions will need metadata passed somehow, and that is precisely the point of the sampling header.

codefromthecrypt · 2019-09-27T06:32:59Z

In other words, it would be extremely dangerous to just look at a sampling key and assume that means it is sampled. Do so at your own risk and definitely that behavior won't be done here.

connectwithnara · 2019-09-27T06:39:10Z

Actually you are right. Because sampling_key is sent even when the request is not sampled, there is a need for something to tell downstream to participate or not. Hence we need two modes - ACTIVE and PASSIVE and use spanId as a way to know whether the request is sampled upstream.

That said, for our use I have implemented to provision the sampling_key only if the request is matched and sampled.

Basically, the request matcher and the sampler are always coupled together and will provision the sampling_key if both of this passes. This can run in the gateway and also for any downstream services.

codefromthecrypt · 2019-09-27T06:43:49Z

ok please read some things on https://gitter.im/openzipkin/secondary-sampling also, as there's some context to cover. It isn't easy to jump into propagation design, but I'd like to help you understand the corner cases.

connectwithnara · 2019-09-28T03:11:42Z

Using this PR I learned my understanding of PASSIVE is different from what you have in mind.

Let me review these modes with some sample real world use cases.

Consider the following call paths:

Use case 1:

[(gateway)] -> api -> [playback] -> license -> cache -> licensedb

Sample 100 req/s for the endpoint gateway:/play&country=US and collect spans for those requests from services gateway and playback. The participation of gateway and playback should be controlled via a config property.

Solution:
The gateway provisions the sampling_key - usa-play-requests for all requests that match /play&country=US

Following triggers are setup as a result of dynamic configuration:

    .addTrigger("usa-play-requests", "gateway", new Trigger().rps(100))
    .addTrigger("usa-play-requests", "playback", new Trigger().mode(PASSIVE));

The trigger setup on gateway would [start trace] at the rate of 100/s for requests that has the sampling_key (provisioned).
The trigger setup on playback would [sample requests] that are upstream sampled and a match found for the sample_key in the dynamic config.
Remaining downstream services - license, cache and licensedb will see the sampling_key, however no requests will be sampled because no trigger is setup for them.

Use case 2:

(gateway) -> [auth] -> [cache] -> authdb

Sample all requests for the endpoint gateway:/auth and collect spans starting at service auth and 1 level down. Participation should be controlled by ttl.

Solution:
The gateway (provisions) the sampling_key - all-auth-requests for all requests that match /auth

Following triggers are setup:

    .addTrigger("all-auth-requests", "auth", new Trigger().sampleAll().ttl(1))

The trigger setup on auth would [start trace] for all requests that has the sampling_key provisioned.
The downstream service cache would [sample requests] that has the sampling_key and a not expired ttl
The ttl expires and the next hop - authdb won't sample any requests

In the above cases, except for the one marked PASSIVE, everything else is an ACTIVE trigger.

codefromthecrypt · 2019-09-28T13:27:26Z

Details look very good @narayaruna

codefromthecrypt · 2019-09-29T09:20:39Z

openzipkin/brave#997 to allow you to specify the rule you mentioned

codefromthecrypt mentioned this issue Sep 27, 2019

Documents requests per second #14

Merged

connectwithnara mentioned this issue Sep 27, 2019

Perform evaluation for a PASSIVE trigger #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define active and passive modes #15

Define active and passive modes #15

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 27, 2019 •

edited

Loading

codefromthecrypt commented Sep 27, 2019

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 27, 2019

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 28, 2019

codefromthecrypt commented Sep 28, 2019

codefromthecrypt commented Sep 29, 2019

Define active and passive modes #15

Define active and passive modes #15

Comments

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 27, 2019 • edited Loading

codefromthecrypt commented Sep 27, 2019

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 27, 2019

codefromthecrypt commented Sep 27, 2019

connectwithnara commented Sep 28, 2019

codefromthecrypt commented Sep 28, 2019

codefromthecrypt commented Sep 29, 2019

connectwithnara commented Sep 27, 2019 •

edited

Loading