harvest.log summary does not agree with OpenSearch counts #147

plawton-umd · 2024-01-22T20:49:07Z

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I did compared information from the harvest.log to the OpenSearch (OS) query results, I noticed differences.

🕵️ Expected behavior

I expected the "count" after the load to equal the "count" before the load plus the harvest.log's number of "Loaded Files".
The harvest.log summary says 150 fewer files were loaded than the OS "count" ( curl -u $REGUSER $OPENSEARCH_URL'/registry/_count?pretty=true' ) says.

📜 To Reproduce

Maybe have a harvest run experience skips?

🖥 Environment Info

Version of this software 3.8.2
Operating System: Linux

🩺 Test Data / Additional context

See above

🦄 Related requirements

Tightly coupled with

⚙️ Engineering Details

N/A

alexdunnjpl · 2024-06-24T16:51:14Z

This is a shot in the dark, but I don't want to overlook the potential of it being relevant - if the harvest is experiencing any errors due to timeouts, it's possible for them to be listed as failures (because the client never received confirmation that the insertions succeeded) but for them to be ingested nonetheless (because the server did get those insertions and processed them, but was overloaded at the time and took too long to handle them).

@plawton-umd if you have any firm sense of whether this is plausible, let me know

plawton-umd · 2024-06-25T16:26:40Z

@alexdunnjpl No idea. Sometimes in the logs it looks like it

tried,
failed,
tried again later,
succeeded,
did not changed the 'failed' to 'failed, but retry succeeded' or reduce the number of filed products or update the overall success count

tloubrieu-jpl · 2024-12-19T19:30:20Z

Hi @al-niessner ,

Could the fix made on NASA-PDS/registry-mgr#116 also help here ?

al-niessner · 2025-01-02T19:17:07Z

Okay, harvest counting is a mess. I am using registry-ref-data/custom-datasets. There are 138 products in this directory. However, when harvest ingests it loads 69 of them. Maybe because harvest only loads the latest? Can either @jordanpadams and/or @tloubrieu-jpl confirm that harvest should be loading just the latest. Should this be a flag and should it be the default to load all?

Now, lets just assume that half of the files are older versions so that 69 is the correct number. It then breaks down to:

Loaded files: 69
   Product_Bundle: 7
   Product_Collection: 12
   Product_Document: 3
   Product_Observational: 21
   Product_SPICE_Kernel: 26

Luckily, maybe, the sub numbers add up to 69. When I look on the registry at count it gives me:

{
  "count" : 46,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

In order to attack this, need to figure out which of the 3 numbers is correct and should be ingested: 138, 69, or 46. Anyone of you (@alexdunnjpl @jordanpadams @tloubrieu-jpl) want to take a stab at which number is correct.

Gads, what a bigger mess. Double checked and ingesting a bundle allows to specify versions as a comma separated list with all being the default when not defined. No other harvest type allows for versioning so no idea what has been hard coded. Back to first question then, what is the desired behavior?

jordanpadams · 2025-01-07T17:13:16Z

@al-niessner

Can either @jordanpadams and/or @tloubrieu-jpl confirm that harvest should be loading just the latest. Should this be a flag and should it be the default to load all?

No. Harvest should be loading all products. The only condition harvest should skip products is if the lidvid already exists in the registry, and the overwrite flag is not enabled.

I see 71 products under custom-datasets?

$ find custom-datasets/ -name "*.xml" | wc -l
      71

Either way, 46 is not correct. Was this loaded into AOSS or a local OS? If the former, this could be an indexing delay?

al-niessner · 2025-01-07T17:41:42Z

@jordanpadams

!0 minute delay between loading and looking. Not an indexing delay.

Load all it is. Let me see where harvest is going amok and fix it.

al-niessner · 2025-01-10T21:30:52Z

@jordanpadams @tloubrieu-jpl

Okay, starting to find some oddities. What do you want to happen when harvest finds the same lidvid during a harvest. Keep in mind that harvest does not look at all the lidvids prior to processing. In the case of registry-ref-data/custom-data there are repeated lidvids. One that know of is urn:nasa:pds:mars2020.spice::1.0 but that does not account for all of the dependencies. Questions is, what do you want harvest to do after it has already pushed one of potentially duplicate lidvids into and index? Keep in mind that order files are processed is repeatedly random. Do you want a new measure for duplicates? Error message then stop all processing? -> does not help me much for debugging. Warning message about unknown content for lidivid X but process continues?

al-niessner · 2025-01-10T22:19:12Z

Ah, clarity. Here is why my numbers are not working:

[FATAL] The harvested collection has duplicate lidvids. Double check content of these lidvids:
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_00007.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_ra_tlmres_0000_0089_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_ra_tlmres_0089_0179_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_refit_v02.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:document:spiceds::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:document::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_rover_tlm_0000_0089_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:sclk_m2020_168_sclkscet_refit_v01.tsc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:ck_m2020_surf_rover_tlm_0089_0179_v1.bc::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 2 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:mk_m2020::2.0
[FATAL] (RegistryDocBatch.java:showDuplicates:127)    Found 3 of lidvid urn:nasa:pds:mars2020.spice:spice_kernels:mk_m2020::1.0
[FATAL] (RegistryDocBatch.java:showDuplicates:135)    Total number of duplicate lidvids: 23
[SUMMARY] (HarvestCmd.java:printSummary:282) Summary:
[SUMMARY] (HarvestCmd.java:printSummary:285) Skipped files: 0
[SUMMARY] (HarvestCmd.java:printSummary:286) Loaded files: 69
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Bundle: 7
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Collection: 12
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Document: 3
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Observational: 21
[SUMMARY] (HarvestCmd.java:printSummary:292)   Product_SPICE_Kernel: 26
[SUMMARY] (HarvestCmd.java:printSummary:296) Failed files: 0

Now it makes sense. There are 23 duplicates and 23+46=69. Yay.

jordanpadams · 2025-01-11T18:13:17Z

@al-niessner phew. that is good news! I was concerned we were missing something here.

Can we update the RegistryDocBatch to throw a WARNING like:

Found X products with LIDVID: <lidvid>. Please review this data set and re-run Harvest.

I will create a separate ticket to update the test data since that will require some additional consideration.

jordanpadams · 2025-01-11T18:40:05Z

@al-niessner you can doublecheck this all works if you want to try out the data from the branch on this PR: NASA-PDS/registry-ref-data#9

That should now have all unique data.

al-niessner · 2025-01-11T19:14:08Z

@al-niessner phew. that is good news! I was concerned we were missing something here.

Can we update the RegistryDocBatch to throw a WARNING like:
Found X products with LIDVID: <lidvid>. Please review this data set and re-run Harvest.
I will create a separate ticket to update the test data since that will require some additional consideration.

@jordanpadams

It is hard for RegistryDocBatch to toss an error at the moment because it can be duplicated in the batch or across batches. I found it by shrinking the batch to 5 and then checking DB count at each five along with lidvids. Second problem is you will get multiple warnings that same LIDVID is duplicated rather than one warning that it was duplicated N times. Third problem is keeping track of total duplicates is messier at the moment because now you need the big map of everything you processed and a static integer. Doing duplication at the end makes for a more concise message block that is right above the summary and not lost in the million other message.

I can change them to warnings but users do not read warnings and their database is fatally broken at this point. There is not way to know which of the duplicates is in the database because the crawler blindly walks the tree (probably the order the file system has them in naturally) and loads them in a first come first serve. It means that test file with product X in it that is accidentally found and loaded now overwrites the good product X. Seems more fatal to me to catch the users attention. Stuff is ingested otherwise.

If you still want inline warnings despite the obvious problems with them, let me know and I will change it. It is not saving you from saving all the LIDVIDs in a giant map.

al-niessner · 2025-01-11T19:45:39Z

@al-niessner you can doublecheck this all works if you want to try out the data from the branch on this PR: NASA-PDS/registry-ref-data#9

That should now have all unique data.

@jordanpadams

Using the given branch harvest says:

2025-01-11 11:19:06 [INFO ] (MetadataWriter.java:writeBatch:106) Wrote 69 product(s)
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:282) Summary:
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:285) Skipped files: 0
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:286) Loaded files: 69
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Bundle: 7
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Collection: 12
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Document: 3
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_Observational: 21
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:292)   Product_SPICE_Kernel: 26
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:296) Failed files: 0
2025-01-11 11:19:06 [SUMMARY] (HarvestCmd.java:printSummary:297) Package ID: 063d4af8-c763-4631-89e8-b3163eaf9d98

and database count says:

{
  "count" : 69,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

and database refs count says:

{
  "count" : 15,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

The refs count is tricky because it has 3 S1 which harvest is not counting. Seems right given much of our other behavior with P* and S*.

jordanpadams · 2025-01-14T20:25:34Z

@al-niessner roger that. so do we think we can call this closed?

plawton-umd added bug Something isn't working needs:triage labels Jan 22, 2024

plawton-umd assigned jordanpadams Jan 22, 2024

plawton-umd added this to EN Portfolio Backlog Jan 22, 2024

github-project-automation bot moved this to Backlog in EN Portfolio Backlog Jan 22, 2024

plawton-umd changed the title ~~harvest.log summary does not agree with OPenSearch counts~~ harvest.log summary does not agree with OpenSearch counts Jan 22, 2024

jordanpadams added s.medium and removed needs:triage labels Feb 5, 2024

jordanpadams removed their assignment Feb 5, 2024

jordanpadams added the icebox label Feb 5, 2024

This was referenced Feb 5, 2024

harvest.log summary not correct #149

Closed

harvest.log summary numbers not consistent with number of expected files #148

Closed

jordanpadams mentioned this issue Mar 19, 2024

Improve Registry Loader Tools Log Messages and Summary Information #154

Open

jordanpadams added B15.0 and removed icebox labels Apr 10, 2024

jordanpadams added this to B15.0 Apr 10, 2024

github-project-automation bot moved this to Release Backlog in B15.0 Apr 10, 2024

jordanpadams moved this from Release Backlog to 🚀 Sprint Backlog in B15.0 Jun 4, 2024

jordanpadams moved this from 🚀 Sprint Backlog to Release Backlog in B15.0 Jun 4, 2024

alexdunnjpl self-assigned this Jun 25, 2024

pdsen-ci added open.4.0.0 open.4.0.1 labels Aug 26, 2024

pdsen-ci added open.4.0.2 open.4.0.3 labels Oct 16, 2024

tloubrieu-jpl added the open.v4.0.3 label Oct 16, 2024

pdsen-ci added the open.4.0.4 label Nov 12, 2024

pdsen-ci added the open.4.0.5 label Dec 19, 2024

tloubrieu-jpl assigned al-niessner Dec 19, 2024

tloubrieu-jpl added the sprint-backlog label Dec 19, 2024

jordanpadams removed open.4.0.0 open.4.0.1 open.4.0.2 open.4.0.3 open.v4.0.3 open.4.0.4 labels Jan 2, 2025

al-niessner moved this from ToDo to In Progress in EN Portfolio Backlog Jan 7, 2025

al-niessner mentioned this issue Jan 10, 2025

Fix harvest counting of ingested materials #226

Merged

jordanpadams unassigned alexdunnjpl Jan 11, 2025

github-project-automation bot moved this to ToDo in B15.1 Jan 11, 2025

jordanpadams added this to B15.1 Jan 11, 2025

jordanpadams mentioned this issue Jan 14, 2025

As a user, I want the total number of products returned to match both the total number of products expected to be loaded and the total number of products that were loaded NASA-PDS/registry#358

Open

tloubrieu-jpl closed this as completed in #226 Jan 28, 2025

github-project-automation bot moved this from In Progress to 🏁 Done in EN Portfolio Backlog Jan 28, 2025

github-project-automation bot moved this from ToDo to 🏁 Done in B15.1 Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harvest.log summary does not agree with OpenSearch counts #147

harvest.log summary does not agree with OpenSearch counts #147

plawton-umd commented Jan 22, 2024 •

edited by jordanpadams

Loading

alexdunnjpl commented Jun 24, 2024

plawton-umd commented Jun 25, 2024

tloubrieu-jpl commented Dec 19, 2024

al-niessner commented Jan 2, 2025

jordanpadams commented Jan 7, 2025

al-niessner commented Jan 7, 2025

al-niessner commented Jan 10, 2025

al-niessner commented Jan 10, 2025 •

edited

Loading

jordanpadams commented Jan 11, 2025

jordanpadams commented Jan 11, 2025 •

edited

Loading

al-niessner commented Jan 11, 2025 •

edited

Loading

al-niessner commented Jan 11, 2025

jordanpadams commented Jan 14, 2025

harvest.log summary does not agree with OpenSearch counts #147

harvest.log summary does not agree with OpenSearch counts #147

Comments

plawton-umd commented Jan 22, 2024 • edited by jordanpadams Loading

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

alexdunnjpl commented Jun 24, 2024

plawton-umd commented Jun 25, 2024

tloubrieu-jpl commented Dec 19, 2024

al-niessner commented Jan 2, 2025

jordanpadams commented Jan 7, 2025

al-niessner commented Jan 7, 2025

al-niessner commented Jan 10, 2025

al-niessner commented Jan 10, 2025 • edited Loading

jordanpadams commented Jan 11, 2025

jordanpadams commented Jan 11, 2025 • edited Loading

al-niessner commented Jan 11, 2025 • edited Loading

al-niessner commented Jan 11, 2025

jordanpadams commented Jan 14, 2025

plawton-umd commented Jan 22, 2024 •

edited by jordanpadams

Loading

al-niessner commented Jan 10, 2025 •

edited

Loading

jordanpadams commented Jan 11, 2025 •

edited

Loading

al-niessner commented Jan 11, 2025 •

edited

Loading