Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28029: Make unit tests based on TxnCommandsBaseForTests/DbTxnManagerEndToEndTestBase run on Tez #5559

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

kasakrisz
Copy link
Contributor

@kasakrisz kasakrisz commented Nov 25, 2024

What changes were proposed in this pull request?

Run tests derived from TxnCommandsBaseForTests and DbTxnManagerEndToEndTestBase using tez execution engine instead of mr.
The test configurations are provided by HiveConfForTest.
Added

    hiveConf.set("tez.grouping.max-size", "10");
    hiveConf.set("tez.grouping.min-size", "1");

because lots of test expects more than one buckets.

Why are the changes needed?

Tez is the default execution engine in Hive and mr is deprecated

Does this PR introduce any user-facing change?

It affects mostly tests.
Tez diagnostics status can contain locking and heart-beating related error messages.

Is the change a dependency upgrade?

No.

How was this patch tested?

mvn test -Dtest=TestDbTxnManager2,TestDbTxnManagerIsolationProperties -pl ql
mvn test -Dtest=TestTxnAddPartition,TestTxnCommands,TestTxnCommands2,TestTxnCommands3,TestTxnCommandsForMmTable,TestTxnConcatenate,TestTxnExIm,TestTxnLoadData,TestTxnNoBuckets,TestMSCKRepairOnAcid,TestHiveStrictManagedMigration,TestUpgradeTool -pl ql

@kasakrisz kasakrisz marked this pull request as draft November 25, 2024 13:49
@github-actions github-actions bot requested a review from abstractdog November 25, 2024 13:49
TezBaseForTests.class.getCanonicalName() + "-" + System.currentTimeMillis())
.getPath().replaceAll("\\\\", "/");

protected void setupTez(HiveConf conf) {
Copy link
Contributor

@abstractdog abstractdog Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this can you please try HiveConfForTest if possible, that has been worked earlier

with that no need to:

setupTez(hiveConf)

instead you can:

new HiveConfForTest(getClass())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to HiveConfForTest and it works.

@@ -280,6 +280,7 @@ public int monitorExecution() {
// best effort
}
console.printError("Execution has failed. stack trace: " + ExceptionUtils.getStackTrace(e));
diagnostics.append(e.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that an intentional change? just asking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to deliver the error message from Tez to the client (HS2). MR has this feature and the test TestTxnCommands2.testFailHeartbeater exploits it.

Copy link
Contributor

@abstractdog abstractdog Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I find this very useful, a bit confused why it's needed, is there a chance you can repro a failure to see how is the error messages are handled?
I mean:


so in the finally block, the final dagStatus (given by the tez am) also contains diagnostics, which is then printed to the console and also added to diagnostics
so this change assumes that there is an error which is swallowed here and is not part of the final dagStatus
if that's the case, it's okay append to the diagnostics, we need to just understand what happened

depending on the nature of this problem, we can also handle this as a follow-up to make this patch clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you pointed is about copying messages from the DAGStatus instance. However in this special test case this instance is null because the LockException is thrown before pulling the status.

status = dagClient.getDAGStatus(opts, checkInterval);

So it seems that heartbeat related checks are handled independently from the tez jobs status and the response coming from the first one is not fed into diagnostics.

Copy link
Contributor

@abstractdog abstractdog Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is an edge where we don't even reach real dag monitoring then
not sure where the diagnostics object is used, but does it make sense to use the full trace, like:

String reportedException = "Execution has failed, stack trace: " + ExceptionUtils.getStackTrace(e);
console.printError(reportedException);
diagnostics.append(reportedException);

@@ -2450,6 +2435,9 @@ public void testCleanerForTxnToWriteId() throws Exception {
// Keep an open txn which refers to the aborted txn.
Context ctx = new Context(hiveConf);
HiveTxnManager txnMgr = TxnManagerFactory.getTxnManagerFactory().getTxnManager(hiveConf);
// Txn is not considered committed or aborted until TXN_OPENTXN_TIMEOUT expires
// See MinOpenTxnIdWaterMarkFunction, OpenTxnTimeoutLowBoundaryTxnIdHandler
waitUntilAllTxnFinished();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wait for all txn to commit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next command is txnMgr.openTxn(ctx, "u1");. During txn opening a new record is inserted into MIN_HISTORY_LEVEL

addTxnToMinHistoryLevel(jdbcResource.getJdbcTemplate().getJdbcTemplate(), maxBatchSize, txnIds, minOpenTxnId);

It has an impact cleaning records from TXN_TO_WRITE_ID

The column value MHL_MIN_OPEN_TXNID is coming from
https://github.com/apache/hive/blob/cf83d4751271622a9a700c6f2330dbfff38801d2/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/OpenTxnsFunction.java#L101C26-L101C55
It is calculated by taking the smaller value from minOpenTxn and lowWaterMark. minOpenTxn is the last open txn id which his null so we go with lowWaterMark.
lowWaterMark is queried by

return "SELECT MAX(\"TXN_ID\") FROM \"TXNS\" WHERE \"TXN_STARTED\" < (" + getEpochFn(databaseProduct) + " - "
+ openTxnTimeOutMillis + ")";

Maybe my test computer is running the queries with TEZ faster than MR and openTxnTimeOutMillis is not expired.
When I added Thread.sleep(openTxnTimeOutMillis) the test passed so I assume this is a timing issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is a simulation of the following sequence of events:

commit-abort-open-abort-commit

I am not sure why we need to wait here since commit & abort are finite states

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can repro the issue on the master branch using MR. Just set
metastore.txn.opentxn.timeout high enough. The default is 1sec and the insert statements in the test run within 1 sec with Tez but more than 1 sec with MR on my test PC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, it might take a bit more time, but I don't want to hold it as a blocker. So, if everything else is resolved, let's merge it and I'll come back with the findings later if any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasakrisz, i've checked out your branch and removed waitUntilAllTxnFinished, but still can't repro, even tried to increase metastore.txn.opentxn.timeout to 2 sec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, that runCleaner has the same wait logic before execution

@@ -91,8 +91,8 @@ public void testRenameTable() throws Exception {
"s/delta_0000001_0000001_0000/bucket_00000_0"},
{"{\"writeid\":2,\"bucketid\":536870913,\"rowid\":0}\t4\t6",
"s/delta_0000002_0000002_0001/bucket_00000_0"}};
checkResult(expected, testQuery, false, "check data", LOG);

List<String> rs = runStatementOnDriver(testQuery);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the checkResult run the query and do the validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkresults also checks whether mappers are vectorized. It is not required here.
We can extract it to a new method but in that case I would rename the current checkresults to something like checkResultsAndVectorization first but it is widely used.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can keep overloaded version without vectorization check, however, I am ok with rename as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added checkResultsAndVectorization and checkresults. The first one calls checkresults.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{"{\"writeid\":0,\"bucketid\":537001984,\"rowid\":1}\t2\t4", "warehouse/t/000002_0"},
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":0}\t5\t6", "warehouse/t/000000_0"},
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t9\t10", "warehouse/t/000000_0"},
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t1\t2", "warehouse/t/HIVE_UNION_SUBDIR_1/000000_0"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we have just 1 bucket?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We still have several buckets. They are in separate sub directories though but HIVE_UNION_SUBDIR_2 for has 2
t/HIVE_UNION_SUBDIR_2/000000_0
t/HIVE_UNION_SUBDIR_2/000001_0

IIUC Hive compiles the union operator in a different way with Tez hence the HIVE_UNION_SUBDIR_s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not see 000001_0 in asserts only "bucketid":536870912
after compaction also only bucket_00000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok. In case of bucket files which are not in acid format the row_id is generated at read. The bucket id is coming from the file name.
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L547C5-L547C44
In case of acid writes it comes from the taskId
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L969C13-L969C26

In case of Tez the taskIds are mapped differently to the files

I changed the update statement to update more than one record to achieve having more than one bucket in the new delta

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants