-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28029: Make unit tests based on TxnCommandsBaseForTests/DbTxnManagerEndToEndTestBase run on Tez #5559
base: master
Are you sure you want to change the base?
Conversation
TezBaseForTests.class.getCanonicalName() + "-" + System.currentTimeMillis()) | ||
.getPath().replaceAll("\\\\", "/"); | ||
|
||
protected void setupTez(HiveConf conf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this can you please try HiveConfForTest if possible, that has been worked earlier
with that no need to:
setupTez(hiveConf)
instead you can:
new HiveConfForTest(getClass())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to HiveConfForTest
and it works.
cf939c3
to
0064ac9
Compare
@@ -280,6 +280,7 @@ public int monitorExecution() { | |||
// best effort | |||
} | |||
console.printError("Execution has failed. stack trace: " + ExceptionUtils.getStackTrace(e)); | |||
diagnostics.append(e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that an intentional change? just asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary to deliver the error message from Tez to the client (HS2). MR has this feature and the test TestTxnCommands2.testFailHeartbeater
exploits it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I find this very useful, a bit confused why it's needed, is there a chance you can repro a failure to see how is the error messages are handled?
I mean:
hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java
Line 296 in cf83d47
if (rc != 0 && status != null) { |
so in the finally block, the final dagStatus (given by the tez am) also contains diagnostics, which is then printed to the console and also added to diagnostics
so this change assumes that there is an error which is swallowed here and is not part of the final dagStatus
if that's the case, it's okay append to the diagnostics, we need to just understand what happened
depending on the nature of this problem, we can also handle this as a follow-up to make this patch clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you pointed is about copying messages from the DAGStatus
instance. However in this special test case this instance is null
because the LockException
is thrown before pulling the status.
hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java
Line 171 in cf83d47
context.checkHeartbeaterLockException(); |
hive/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/monitoring/TezJobMonitor.java
Line 183 in cf83d47
status = dagClient.getDAGStatus(opts, checkInterval); |
So it seems that heartbeat related checks are handled independently from the tez jobs status and the response coming from the first one is not fed into diagnostics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is an edge where we don't even reach real dag monitoring then
not sure where the diagnostics object is used, but does it make sense to use the full trace, like:
String reportedException = "Execution has failed, stack trace: " + ExceptionUtils.getStackTrace(e);
console.printError(reportedException);
diagnostics.append(reportedException);
@@ -2450,6 +2435,9 @@ public void testCleanerForTxnToWriteId() throws Exception { | |||
// Keep an open txn which refers to the aborted txn. | |||
Context ctx = new Context(hiveConf); | |||
HiveTxnManager txnMgr = TxnManagerFactory.getTxnManagerFactory().getTxnManager(hiveConf); | |||
// Txn is not considered committed or aborted until TXN_OPENTXN_TIMEOUT expires | |||
// See MinOpenTxnIdWaterMarkFunction, OpenTxnTimeoutLowBoundaryTxnIdHandler | |||
waitUntilAllTxnFinished(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to wait for all txn to commit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next command is txnMgr.openTxn(ctx, "u1");
. During txn opening a new record is inserted into MIN_HISTORY_LEVEL
Line 150 in cf83d47
addTxnToMinHistoryLevel(jdbcResource.getJdbcTemplate().getJdbcTemplate(), maxBatchSize, txnIds, minOpenTxnId); |
It has an impact cleaning records from
TXN_TO_WRITE_ID
The column value MHL_MIN_OPEN_TXNID
is coming from
https://github.com/apache/hive/blob/cf83d4751271622a9a700c6f2330dbfff38801d2/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/OpenTxnsFunction.java#L101C26-L101C55
It is calculated by taking the smaller value from minOpenTxn
and lowWaterMark
. minOpenTxn
is the last open txn id which his null so we go with lowWaterMark
.
lowWaterMark
is queried by
Lines 42 to 43 in cf83d47
return "SELECT MAX(\"TXN_ID\") FROM \"TXNS\" WHERE \"TXN_STARTED\" < (" + getEpochFn(databaseProduct) + " - " | |
+ openTxnTimeOutMillis + ")"; |
Maybe my test computer is running the queries with TEZ faster than MR and openTxnTimeOutMillis
is not expired.
When I added Thread.sleep(openTxnTimeOutMillis)
the test passed so I assume this is a timing issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part is a simulation of the following sequence of events:
commit-abort-open-abort-commit
I am not sure why we need to wait here since commit & abort are finite states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can repro the issue on the master branch using MR. Just set
metastore.txn.opentxn.timeout
high enough. The default is 1sec and the insert statements in the test run within 1 sec with Tez but more than 1 sec with MR on my test PC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, it might take a bit more time, but I don't want to hold it as a blocker. So, if everything else is resolved, let's merge it and I'll come back with the findings later if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kasakrisz, i've checked out your branch and removed waitUntilAllTxnFinished
, but still can't repro, even tried to increase metastore.txn.opentxn.timeout
to 2 sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, that runCleaner
has the same wait logic before execution
@@ -91,8 +91,8 @@ public void testRenameTable() throws Exception { | |||
"s/delta_0000001_0000001_0000/bucket_00000_0"}, | |||
{"{\"writeid\":2,\"bucketid\":536870913,\"rowid\":0}\t4\t6", | |||
"s/delta_0000002_0000002_0001/bucket_00000_0"}}; | |||
checkResult(expected, testQuery, false, "check data", LOG); | |||
|
|||
List<String> rs = runStatementOnDriver(testQuery); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the checkResult
run the query and do the validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkresults
also checks whether mappers are vectorized. It is not required here.
We can extract it to a new method but in that case I would rename the current checkresults
to something like checkResultsAndVectorization
first but it is widely used.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can keep overloaded version without vectorization check, however, I am ok with rename as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added checkResultsAndVectorization
and checkresults
. The first one calls checkresults
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.../java/org/apache/hadoop/hive/metastore/txn/jdbc/functions/MinOpenTxnIdWaterMarkFunction.java
Outdated
Show resolved
Hide resolved
{"{\"writeid\":0,\"bucketid\":537001984,\"rowid\":1}\t2\t4", "warehouse/t/000002_0"}, | ||
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":0}\t5\t6", "warehouse/t/000000_0"}, | ||
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t9\t10", "warehouse/t/000000_0"}, | ||
{"{\"writeid\":0,\"bucketid\":536870912,\"rowid\":1}\t1\t2", "warehouse/t/HIVE_UNION_SUBDIR_1/000000_0"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we have just 1 bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We still have several buckets. They are in separate sub directories though but HIVE_UNION_SUBDIR_2
for has 2
t/HIVE_UNION_SUBDIR_2/000000_0
t/HIVE_UNION_SUBDIR_2/000001_0
IIUC Hive compiles the union operator in a different way with Tez hence the HIVE_UNION_SUBDIR_
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not see 000001_0
in asserts only "bucketid":536870912
after compaction also only bucket_00000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok. In case of bucket files which are not in acid format the row_id is generated at read. The bucket id is coming from the file name.
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L547C5-L547C44
In case of acid writes it comes from the taskId
https://github.com/apache/hive/blob/c27d31722c2a7426f3236d3d892dbe1e206e840d/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L969C13-L969C26
In case of Tez the taskIds are mapped differently to the files
I changed the update statement to update more than one record to achieve having more than one bucket in the new delta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
0064ac9
to
4368d05
Compare
…sult without vectorization check
656088b
to
fe6edd3
Compare
Quality Gate passedIssues Measures |
What changes were proposed in this pull request?
Run tests derived from
TxnCommandsBaseForTests
andDbTxnManagerEndToEndTestBase
usingtez
execution engine instead ofmr
.The test configurations are provided by
HiveConfForTest
.Added
because lots of test expects more than one buckets.
Why are the changes needed?
Tez
is the default execution engine in Hive andmr
is deprecatedDoes this PR introduce any user-facing change?
It affects mostly tests.
Tez diagnostics status can contain locking and heart-beating related error messages.
Is the change a dependency upgrade?
No.
How was this patch tested?