-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
harvest.log summary does not agree with OpenSearch counts #147
Comments
This is a shot in the dark, but I don't want to overlook the potential of it being relevant - if the harvest is experiencing any errors due to timeouts, it's possible for them to be listed as failures (because the client never received confirmation that the insertions succeeded) but for them to be ingested nonetheless (because the server did get those insertions and processed them, but was overloaded at the time and took too long to handle them). @plawton-umd if you have any firm sense of whether this is plausible, let me know |
@alexdunnjpl No idea. Sometimes in the logs it looks like it
|
Hi @al-niessner , Could the fix made on NASA-PDS/registry-mgr#116 also help here ? |
Okay, harvest counting is a mess. I am using registry-ref-data/custom-datasets. There are 138 products in this directory. However, when harvest ingests it loads 69 of them. Maybe because harvest only loads the latest? Can either @jordanpadams and/or @tloubrieu-jpl confirm that harvest should be loading just the latest. Should this be a flag and should it be the default to load all? Now, lets just assume that half of the files are older versions so that 69 is the correct number. It then breaks down to:
Luckily, maybe, the sub numbers add up to 69. When I look on the registry at count it gives me:
In order to attack this, need to figure out which of the 3 numbers is correct and should be ingested: 138, 69, or 46. Anyone of you (@alexdunnjpl @jordanpadams @tloubrieu-jpl) want to take a stab at which number is correct. Gads, what a bigger mess. Double checked and ingesting a bundle allows to specify versions as a comma separated list with |
No. Harvest should be loading all products. The only condition harvest should skip products is if the lidvid already exists in the registry, and the overwrite flag is not enabled. I see 71 products under custom-datasets?
Either way, 46 is not correct. Was this loaded into AOSS or a local OS? If the former, this could be an indexing delay? |
!0 minute delay between loading and looking. Not an indexing delay. Load all it is. Let me see where harvest is going amok and fix it. |
Okay, starting to find some oddities. What do you want to happen when harvest finds the same lidvid during a harvest. Keep in mind that harvest does not look at all the lidvids prior to processing. In the case of registry-ref-data/custom-data there are repeated lidvids. One that know of is |
Ah, clarity. Here is why my numbers are not working:
Now it makes sense. There are 23 duplicates and 23+46=69. Yay. |
@al-niessner phew. that is good news! I was concerned we were missing something here. Can we update the
I will create a separate ticket to update the test data since that will require some additional consideration. |
@al-niessner you can doublecheck this all works if you want to try out the data from the branch on this PR: NASA-PDS/registry-ref-data#9 That should now have all unique data. |
It is hard for RegistryDocBatch to toss an error at the moment because it can be duplicated in the batch or across batches. I found it by shrinking the batch to 5 and then checking DB count at each five along with lidvids. Second problem is you will get multiple warnings that same LIDVID is duplicated rather than one warning that it was duplicated N times. Third problem is keeping track of total duplicates is messier at the moment because now you need the big map of everything you processed and a static integer. Doing duplication at the end makes for a more concise message block that is right above the summary and not lost in the million other message. I can change them to warnings but users do not read warnings and their database is fatally broken at this point. There is not way to know which of the duplicates is in the database because the crawler blindly walks the tree (probably the order the file system has them in naturally) and loads them in a first come first serve. It means that test file with product X in it that is accidentally found and loaded now overwrites the good product X. Seems more fatal to me to catch the users attention. Stuff is ingested otherwise. If you still want inline warnings despite the obvious problems with them, let me know and I will change it. It is not saving you from saving all the LIDVIDs in a giant map. |
Using the given branch harvest says:
and database count says:
and database refs count says:
The refs count is tricky because it has 3 S1 which harvest is not counting. Seems right given much of our other behavior with |
@al-niessner roger that. so do we think we can call this closed? |
Checked for duplicates
Yes - I've already checked
π Describe the bug
When I did compared information from the harvest.log to the OpenSearch (OS) query results, I noticed differences.
π΅οΈ Expected behavior
I expected the "count" after the load to equal the "count" before the load plus the harvest.log's number of "Loaded Files".
The harvest.log summary says 150 fewer files were loaded than the OS "count" ( curl -u $REGUSER $OPENSEARCH_URL'/registry/_count?pretty=true' ) says.
π To Reproduce
π₯ Environment Info
π©Ί Test Data / Additional context
See above
π¦ Related requirements
Tightly coupled with
βοΈ Engineering Details
N/A
The text was updated successfully, but these errors were encountered: