Skip to content

Commit

Permalink
[controller] Improved the perf of DaVinci push status polling for inc…
Browse files Browse the repository at this point in the history
… push (#1059)

Today, Controller would check push status system store partition by partition in the following way:
1. Iterate the partition by partition, and get all the subscribed DaVinci instances per partition.
2. Check whether instance partition status is completed or not.
3. If not, check the liveness of the instance to decide whether the status should be ignored or not.

If one DaVinci store is being used in unpartitioned way, the total number of calls to push status system
store for liveness check will be partition_num * host_num, which will be a very high number, which would
timeout the Controller status polling request for inc push, which is still doing real-time status check
against push status system store.
With this change, Controller will skip the duplicate liveness calls, so that the total call count will be
partition count + unique host name for each status polling request.
The reason I didn't batch the requests as I would like to reduce the KPS to the push status store to avoid
quota issue and with the reduction, the perf should be good enough.
  • Loading branch information
gaojieliu authored Jul 11, 2024
1 parent 3bb03f5 commit 9af4947
Showing 1 changed file with 8 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import com.linkedin.venice.exceptions.VeniceException;
import com.linkedin.venice.meta.Version;
import com.linkedin.venice.pushstatushelper.PushStatusStoreReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Optional;
Expand Down Expand Up @@ -230,6 +231,10 @@ public static ExecutionStatusWithDetails getDaVinciPartitionLevelPushStatusAndDe
int completedReplicaCount = 0;
Set<String> offlineInstanceList = new HashSet<>();
Set<Integer> incompletePartition = new HashSet<>();
/**
* This cache is used to reduce the duplicate calls for liveness check as one host can host multiple partitions.
*/
Map<String, Boolean> instanceLivenessCache = new HashMap<>();
for (int partitionId = 0; partitionId < partitionCount; partitionId++) {
Map<CharSequence, Integer> instances =
reader.getPartitionStatus(storeName, version, partitionId, incrementalPushVersion);
Expand All @@ -243,7 +248,9 @@ public static ExecutionStatusWithDetails getDaVinciPartitionLevelPushStatusAndDe
completedReplicaCount++;
continue;
}
boolean isInstanceAlive = reader.isInstanceAlive(storeName, entry.getKey().toString());
String instanceName = entry.getKey().toString();
boolean isInstanceAlive = instanceLivenessCache
.computeIfAbsent(instanceName, ignored -> reader.isInstanceAlive(storeName, instanceName));
if (!isInstanceAlive) {
// Keep at most 5 offline instances for logging purpose.
if (offlineInstanceList.size() < 5) {
Expand Down

0 comments on commit 9af4947

Please sign in to comment.