HIVE-28029: Make unit tests based on TxnCommandsBaseForTests/DbTxnManagerEndToEndTestBase run on Tez #5559

kasakrisz · 2024-11-25T13:49:08Z

What changes were proposed in this pull request?

Run tests derived from TxnCommandsBaseForTests and DbTxnManagerEndToEndTestBase using tez execution engine instead of mr.
The test configurations are provided by HiveConfForTest.
Added

    hiveConf.set("tez.grouping.max-size", "10");
    hiveConf.set("tez.grouping.min-size", "1");

because lots of test expects more than one buckets.

Why are the changes needed?

Tez is the default execution engine in Hive and mr is deprecated

Does this PR introduce any user-facing change?

It affects mostly tests.
Tez diagnostics status can contain locking and heart-beating related error messages.

Is the change a dependency upgrade?

No.

How was this patch tested?

mvn test -Dtest=TestDbTxnManager2,TestDbTxnManagerIsolationProperties -pl ql
mvn test -Dtest=TestTxnAddPartition,TestTxnCommands,TestTxnCommands2,TestTxnCommands3,TestTxnCommandsForMmTable,TestTxnConcatenate,TestTxnExIm,TestTxnLoadData,TestTxnNoBuckets,TestMSCKRepairOnAcid,TestHiveStrictManagedMigration,TestUpgradeTool -pl ql

abstractdog · 2024-11-25T13:54:35Z

ql/src/test/org/apache/hadoop/hive/ql/TezBaseForTests.java

+                  TezBaseForTests.class.getCanonicalName() + "-" + System.currentTimeMillis())
+          .getPath().replaceAll("\\\\", "/");
+
+  protected void setupTez(HiveConf conf) {


instead of this can you please try HiveConfForTest if possible, that has been worked earlier

with that no need to:

setupTez(hiveConf)

instead you can:

new HiveConfForTest(getClass())

I switched to HiveConfForTest and it works.

ql/src/test/org/apache/hadoop/hive/ql/TezBaseForTests.java

deniskuzZ · 2024-12-05T15:03:48Z

ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java

@@ -280,6 +280,7 @@ public int monitorExecution() {
            // best effort
          }
          console.printError("Execution has failed. stack trace: " + ExceptionUtils.getStackTrace(e));
+          diagnostics.append(e.getMessage());


is that an intentional change? just asking

It is necessary to deliver the error message from Tez to the client (HS2). MR has this feature and the test TestTxnCommands2.testFailHeartbeater exploits it.

while I find this very useful, a bit confused why it's needed, is there a chance you can repro a failure to see how is the error messages are handled?
I mean:

hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java

Line 296 in cf83d47

if (rc != 0 && status != null) {

so in the finally block, the final dagStatus (given by the tez am) also contains diagnostics, which is then printed to the console and also added to diagnostics
so this change assumes that there is an error which is swallowed here and is not part of the final dagStatus
if that's the case, it's okay append to the diagnostics, we need to just understand what happened

depending on the nature of this problem, we can also handle this as a follow-up to make this patch clearer

The code you pointed is about copying messages from the DAGStatus instance. However in this special test case this instance is null because the LockException is thrown before pulling the status.

hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java

Line 171 in cf83d47

context.checkHeartbeaterLockException();

hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java

Line 183 in cf83d47

status = dagClient.getDAGStatus(opts, checkInterval);

So it seems that heartbeat related checks are handled independently from the tez jobs status and the response coming from the first one is not fed into diagnostics.

I see, this is an edge where we don't even reach real dag monitoring then
not sure where the diagnostics object is used, but does it make sense to use the full trace, like:

String reportedException = "Execution has failed, stack trace: " + ExceptionUtils.getStackTrace(e); console.printError(reportedException); diagnostics.append(reportedException);

ql/src/test/org/apache/hadoop/hive/ql/TestTxnAddPartition.java

ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java

deniskuzZ · 2024-12-05T15:17:53Z

ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java

@@ -2450,6 +2435,9 @@ public void testCleanerForTxnToWriteId() throws Exception {
    // Keep an open txn which refers to the aborted txn.
    Context ctx = new Context(hiveConf);
    HiveTxnManager txnMgr = TxnManagerFactory.getTxnManagerFactory().getTxnManager(hiveConf);
+    // Txn is not considered committed or aborted until TXN_OPENTXN_TIMEOUT expires
+    // See MinOpenTxnIdWaterMarkFunction, OpenTxnTimeoutLowBoundaryTxnIdHandler
+    waitUntilAllTxnFinished();


why do we need to wait for all txn to commit here?

The next command is txnMgr.openTxn(ctx, "u1");. During txn opening a new record is inserted into MIN_HISTORY_LEVEL

hive/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/OpenTxnsFunction.java

Line 150 in cf83d47

addTxnToMinHistoryLevel(jdbcResource.getJdbcTemplate().getJdbcTemplate(), maxBatchSize, txnIds, minOpenTxnId);

It has an impact cleaning records from TXN_TO_WRITE_ID

The column value MHL_MIN_OPEN_TXNID is coming from
https://github.com/apache/hive/blob/cf83d4751271622a9a700c6f2330dbfff38801d2/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/OpenTxnsFunction.java#L101C26-L101C55
It is calculated by taking the smaller value from minOpenTxn and lowWaterMark. minOpenTxn is the last open txn id which his null so we go with lowWaterMark.
lowWaterMark is queried by

hive/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/jdbc/queries/OpenTxnTimeoutLowBoundaryTxnIdHandler.java

Lines 42 to 43 in cf83d47

return "SELECT MAX(\"TXN_ID\") FROM \"TXNS\" WHERE \"TXN_STARTED\" < (" + getEpochFn(databaseProduct) + " - "

+ openTxnTimeOutMillis + ")";

Maybe my test computer is running the queries with TEZ faster than MR and openTxnTimeOutMillis is not expired.
When I added Thread.sleep(openTxnTimeOutMillis) the test passed so I assume this is a timing issue.

this part is a simulation of the following sequence of events:

commit-abort-open-abort-commit

I am not sure why we need to wait here since commit & abort are finite states

You can repro the issue on the master branch using MR. Just set
metastore.txn.opentxn.timeout high enough. The default is 1sec and the insert statements in the test run within 1 sec with Tez but more than 1 sec with MR on my test PC.

I'm sorry, it might take a bit more time, but I don't want to hold it as a blocker. So, if everything else is resolved, let's merge it and I'll come back with the findings later if any.

@kasakrisz, i've checked out your branch and removed waitUntilAllTxnFinished, but still can't repro, even tried to increase metastore.txn.opentxn.timeout to 2 sec

note, that runCleaner has the same wait logic before execution

ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java

ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands3.java

deniskuzZ · 2024-12-05T15:22:28Z

ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands3.java

@@ -91,8 +91,8 @@ public void testRenameTable() throws Exception {
            "s/delta_0000001_0000001_0000/bucket_00000_0"},
        {"{\"writeid\":2,\"bucketid\":536870913,\"rowid\":0}\t4\t6",
            "s/delta_0000002_0000002_0001/bucket_00000_0"}};
-    checkResult(expected, testQuery, false, "check data", LOG);
-
+    List<String> rs = runStatementOnDriver(testQuery);


should the checkResult run the query and do the validation?

checkresults also checks whether mappers are vectorized. It is not required here.
We can extract it to a new method but in that case I would rename the current checkresults to something like checkResultsAndVectorization first but it is widely used.
WDYT?

maybe we can keep overloaded version without vectorization check, however, I am ok with rename as well

Added checkResultsAndVectorization and checkresults. The first one calls checkresults.

.../java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/MinOpenTxnIdWaterMarkFunction.java

deniskuzZ · 2024-12-05T15:29:26Z

ql/src/test/org/apache/hadoop/hive/ql/TestTxnNoBuckets.java

-        {"{\"writeid\":0,\"bucketid\":537001984,\"rowid\":1}\t2\t4",  "warehouse/t/000002_0"},
-        {"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":0}\t5\t6",  "warehouse/t/000000_0"},
-        {"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t9\t10", "warehouse/t/000000_0"},
+        {"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t1\t2",  "warehouse/t/HIVE_UNION_SUBDIR_1/000000_0"},


now we have just 1 bucket?

No. We still have several buckets. They are in separate sub directories though but HIVE_UNION_SUBDIR_2 for has 2
t/HIVE_UNION_SUBDIR_2/000000_0
t/HIVE_UNION_SUBDIR_2/000001_0

IIUC Hive compiles the union operator in a different way with Tez hence the HIVE_UNION_SUBDIR_s.

i do not see 000001_0 in asserts only "bucketid":536870912
after compaction also only bucket_00000

Oh, ok. In case of bucket files which are not in acid format the row_id is generated at read. The bucket id is coming from the file name.
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L547C5-L547C44
In case of acid writes it comes from the taskId
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L969C13-L969C26

In case of Tez the taskIds are mapped differently to the files

I changed the update statement to update more than one record to achieve having more than one bucket in the new delta

…sult without vectorization check

sonarqubecloud · 2024-12-11T11:16:18Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

kasakrisz marked this pull request as draft November 25, 2024 13:49

github-actions bot requested a review from abstractdog November 25, 2024 13:49

asf-ci-hive added the tests pending label Nov 25, 2024

abstractdog reviewed Nov 25, 2024

View reviewed changes

asf-ci-hive added tests unstable and removed tests pending labels Nov 25, 2024

deniskuzZ reviewed Nov 25, 2024

View reviewed changes

ql/src/test/org/apache/hadoop/hive/ql/TezBaseForTests.java Outdated Show resolved Hide resolved

kasakrisz force-pushed the HIVE-28029-master-tez-ut branch from cf939c3 to 0064ac9 Compare December 5, 2024 14:32

asf-ci-hive added tests pending and removed tests unstable labels Dec 5, 2024