Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

harvest.log summary does not agree with OpenSearch counts #147

Closed
plawton-umd opened this issue Jan 22, 2024 · 13 comments Β· Fixed by #226
Closed

harvest.log summary does not agree with OpenSearch counts #147

plawton-umd opened this issue Jan 22, 2024 · 13 comments Β· Fixed by #226
Assignees

Comments

@plawton-umd
Copy link

plawton-umd commented Jan 22, 2024

Checked for duplicates

Yes - I've already checked

πŸ› Describe the bug

When I did compared information from the harvest.log to the OpenSearch (OS) query results, I noticed differences.

πŸ•΅οΈ Expected behavior

I expected the "count" after the load to equal the "count" before the load plus the harvest.log's number of "Loaded Files".
The harvest.log summary says 150 fewer files were loaded than the OS "count" ( curl -u $REGUSER $OPENSEARCH_URL'/registry/_count?pretty=true' ) says.

πŸ“œ To Reproduce

  1. Maybe have a harvest run experience skips?

πŸ–₯ Environment Info

  • Version of this software 3.8.2
  • Operating System: Linux

🩺 Test Data / Additional context

See above

πŸ¦„ Related requirements

Tightly coupled with

βš™οΈ Engineering Details

N/A

@plawton-umd plawton-umd added bug Something isn't working needs:triage labels Jan 22, 2024
@plawton-umd plawton-umd changed the title harvest.log summary does not agree with OPenSearch counts harvest.log summary does not agree with OpenSearch counts Jan 22, 2024
@jordanpadams jordanpadams removed their assignment Feb 5, 2024
@jordanpadams jordanpadams added B15.0 and removed icebox labels Apr 10, 2024
@github-project-automation github-project-automation bot moved this to Release Backlog in B15.0 Apr 10, 2024
@jordanpadams jordanpadams moved this from Release Backlog to πŸš€ Sprint Backlog in B15.0 Jun 4, 2024
@jordanpadams jordanpadams moved this from πŸš€ Sprint Backlog to Release Backlog in B15.0 Jun 4, 2024
@alexdunnjpl
Copy link
Contributor

This is a shot in the dark, but I don't want to overlook the potential of it being relevant - if the harvest is experiencing any errors due to timeouts, it's possible for them to be listed as failures (because the client never received confirmation that the insertions succeeded) but for them to be ingested nonetheless (because the server did get those insertions and processed them, but was overloaded at the time and took too long to handle them).

@plawton-umd if you have any firm sense of whether this is plausible, let me know

@plawton-umd
Copy link
Author

@alexdunnjpl No idea. Sometimes in the logs it looks like it

  • tried,
  • failed,
  • tried again later,
  • succeeded,
  • did not changed the 'failed' to 'failed, but retry succeeded' or reduce the number of filed products or update the overall success count

@tloubrieu-jpl
Copy link
Member

Hi @al-niessner ,

Could the fix made on NASA-PDS/registry-mgr#116 also help here ?

@al-niessner
Copy link
Contributor

Okay, harvest counting is a mess. I am using registry-ref-data/custom-datasets. There are 138 products in this directory. However, when harvest ingests it loads 69 of them. Maybe because harvest only loads the latest? Can either @jordanpadams and/or @tloubrieu-jpl confirm that harvest should be loading just the latest. Should this be a flag and should it be the default to load all?

Now, lets just assume that half of the files are older versions so that 69 is the correct number. It then breaks down to:

Loaded files: 69
   Product_Bundle: 7
   Product_Collection: 12
   Product_Document: 3
   Product_Observational: 21
   Product_SPICE_Kernel: 26

Luckily, maybe, the sub numbers add up to 69. When I look on the registry at count it gives me:

{
  "count" : 46,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

In order to attack this, need to figure out which of the 3 numbers is correct and should be ingested: 138, 69, or 46. Anyone of you (@alexdunnjpl @jordanpadams @tloubrieu-jpl) want to take a stab at which number is correct.

Gads, what a bigger mess. Double checked and ingesting a bundle allows to specify versions as a comma separated list with all being the default when not defined. No other harvest type allows for versioning so no idea what has been hard coded. Back to first question then, what is the desired behavior?

@jordanpadams
Copy link
Member

@al-niessner

Can either @jordanpadams and/or @tloubrieu-jpl confirm that harvest should be loading just the latest. Should this be a flag and should it be the default to load all?

No. Harvest should be loading all products. The only condition harvest should skip products is if the lidvid already exists in the registry, and the overwrite flag is not enabled.

I see 71 products under custom-datasets?

$ find custom-datasets/ -name "*.xml" | wc -l
      71

Either way, 46 is not correct. Was this loaded into AOSS or a local OS? If the former, this could be an indexing delay?

@al-niessner
Copy link
Contributor

@jordanpadams

!0 minute delay between loading and looking. Not an indexing delay.

Load all it is. Let me see where harvest is going amok and fix it.

@al-niessner al-niessner moved this from ToDo to In Progress in EN Portfolio Backlog Jan 7, 2025
@al-niessner
Copy link
Contributor

@jordanpadams @tloubrieu-jpl

Okay, starting to find some oddities. What do you want to happen when harvest finds the same lidvid during a harvest. Keep in mind that harvest does not look at all the lidvids prior to processing. In the case of registry-ref-data/custom-data there are repeated lidvids. One that know of is urn:nasa:pds:mars2020.spice::1.0 but that does not account for all of the dependencies. Questions is, what do you want harvest to do after it has already pushed one of potentially duplicate lidvids into and index? Keep in mind that order files are processed is repeatedly random. Do you want a new measure for duplicates? Error message then stop all processing? -> does not help me much for debugging. Warning message about unknown content for lidivid X but process continues?

@al-niessner
Copy link
Contributor

al-niessner commented Jan 10, 2025

Ah, clarity. Here is why my numbers are not working:

[FATAL] The harvested collection has duplicate lidvids. Double check content of these lidvids:
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_00007.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_ra_tlmres_0000_0089_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_ra_tlmres_0089_0179_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_refit_v02.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:document:spiceds::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:document::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_rover_tlm_0000_0089_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_refit_v01.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_rover_tlm_0089_0179_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:mk_m2020::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:mk_m2020::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:135)    Total number of duplicate lidvids: 23
[SUMMARY] (HarvestCmd.java:printSummary:282) Summary:
[SUMMARY] (HarvestCmd.java:printSummary:285) Skipped files: 0
[SUMMARY] (HarvestCmd.java:printSummary:286) Loaded files: 69
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Bundle: 7
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Collection: 12
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Document: 3
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Observational: 21
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_SPICE_Kernel: 26
[SUMMARY] (HarvestCmd.java:printSummary:296) Failed files: 0

Now it makes sense. There are 23 duplicates and 23+46=69. Yay.

@jordanpadams
Copy link
Member

@al-niessner phew. that is good news! I was concerned we were missing something here.

Can we update the RegistryDocBatch to throw a WARNING like:

Found X products with LIDVID: <lidvid>. Please review this data set and re-run Harvest.

I will create a separate ticket to update the test data since that will require some additional consideration.

@jordanpadams
Copy link
Member

jordanpadams commented Jan 11, 2025

@al-niessner you can doublecheck this all works if you want to try out the data from the branch on this PR: NASA-PDS/registry-ref-data#9

That should now have all unique data.

@al-niessner
Copy link
Contributor

al-niessner commented Jan 11, 2025

@al-niessner phew. that is good news! I was concerned we were missing something here.

Can we update the RegistryDocBatch to throw a WARNING like:

Found X products with LIDVID: <lidvid>. Please review this data set and re-run Harvest.

I will create a separate ticket to update the test data since that will require some additional consideration.

@jordanpadams

It is hard for RegistryDocBatch to toss an error at the moment because it can be duplicated in the batch or across batches. I found it by shrinking the batch to 5 and then checking DB count at each five along with lidvids. Second problem is you will get multiple warnings that same LIDVID is duplicated rather than one warning that it was duplicated N times. Third problem is keeping track of total duplicates is messier at the moment because now you need the big map of everything you processed and a static integer. Doing duplication at the end makes for a more concise message block that is right above the summary and not lost in the million other message.

I can change them to warnings but users do not read warnings and their database is fatally broken at this point. There is not way to know which of the duplicates is in the database because the crawler blindly walks the tree (probably the order the file system has them in naturally) and loads them in a first come first serve. It means that test file with product X in it that is accidentally found and loaded now overwrites the good product X. Seems more fatal to me to catch the users attention. Stuff is ingested otherwise.

If you still want inline warnings despite the obvious problems with them, let me know and I will change it. It is not saving you from saving all the LIDVIDs in a giant map.

@al-niessner
Copy link
Contributor

@al-niessner you can doublecheck this all works if you want to try out the data from the branch on this PR: NASA-PDS/registry-ref-data#9

That should now have all unique data.

@jordanpadams

Using the given branch harvest says:

2025-01-11 11:19:06 [INFO ] (MetadataWriter.java:writeBatch:106) Wrote 69 product(s)
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:282) Summary:
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:285) Skipped files: 0
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:286) Loaded files: 69
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Bundle: 7
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Collection: 12
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Document: 3
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Observational: 21
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_SPICE_Kernel: 26
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:296) Failed files: 0
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:297) Package ID: 063d4af8-c763-4631-89e8-b3163eaf9d98

and database count says:

{
  "count" : 69,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

and database refs count says:

{
  "count" : 15,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

The refs count is tricky because it has 3 S1 which harvest is not counting. Seems right given much of our other behavior with P* and S*.

@jordanpadams
Copy link
Member

@al-niessner roger that. so do we think we can call this closed?

@github-project-automation github-project-automation bot moved this from In Progress to 🏁 Done in EN Portfolio Backlog Jan 28, 2025
@github-project-automation github-project-automation bot moved this from ToDo to 🏁 Done in B15.1 Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏁 Done
Status: 🏁 Done
Status: Release Backlog
Development

Successfully merging a pull request may close this issue.

6 participants