Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect date added to database when GPS data is set incorrectly #143

Open
jgyprime opened this issue Jun 22, 2024 · 10 comments
Open

Incorrect date added to database when GPS data is set incorrectly #143

jgyprime opened this issue Jun 22, 2024 · 10 comments

Comments

@jgyprime
Copy link

Thank you for this great and amazing software.
I've been using it with almost 4 TB of personal photos (only photos).
I think that I have more than 500k photos there...

But I think I found something that can be improved.

After the initial indexation of photos finished (it took several days on my low powered Celeron NAS), I observed that a lot of my photos were added to the database incorrectly, with 1970 as year...
And I started investigating the reason.

For example, in the photo I uploaded, the GPS data is set incorrectly in the photo exif:
20240223_160931_029_gpsdata_1970

# exiftool 20240223_160931_029_gpsdata_1970.jpg | grep -i date
File Modification Date/Time     : 2024:02:23 16:09:36+02:00
File Access Date/Time           : 2024:06:21 16:00:10+03:00
File Inode Change Date/Time     : 2024:06:21 15:55:47+03:00
Modify Date                     : 2024:02:23 16:09:35
Date/Time Original              : 2024:02:23 16:09:35
Create Date                     : 2024:02:23 16:09:35
GPS Date Stamp                  : 1970:01:01
GPS Date/Time                   : 1970:01:01 00:00:00Z
Create Date                     : 2024:02:23 16:09:35.449423
Date/Time Original              : 2024:02:23 16:09:35.449423
Modify Date                     : 2024:02:23 16:09:35.449423

When added to gallery, the date is set to 1970 (date is taken from GPS info in exif)...
I do not know how that GPS date got there, but I can assume that the phone tried to get the GPS date and time, but because the GPS on the phone was disabled, it got back to a default value of something from 1970...

I also found the source of the problem in the source code here:
https://github.com/xemle/home-gallery/blob/master/packages/database/src/media/date.js#L44
const dateKeys = ['GPSDateTime', 'SubSecDateTimeOriginal', 'DateTimeOriginal', 'CreateDate']
If I remove the 'GPSDateTime' item from line 44, then everything works correctly after rebuilding and re-indexing the database.

What do you think?
Is an improvement possible in this case?
For example:

  • adding an option to ignore GPS date (or if possible, to ignore other fields too, so that the user decides what he wants)
  • adding an option to ignore dates before <some_value>
  • reorder dateKeys values (this would be the simplest of them all I think)

Unfortunately, my knowledge of the js language is very close to 0, so I would prefer for someone with enough knowledge to find a potential implementation here.

Thank you for reading my very long post.
Thank you for creating such a nice software.

@xemle
Copy link
Owner

xemle commented Jun 22, 2024

Hi @jgyprime

thank you for using HomeGallery and I am glad that you like it.

Further, thank you for reporting your issue with the date. You did a great job nailing the problem and provided a test picture. Awesome.

Yes. My assumption was: If there is a date provided by GPS, it should be quite accurate. However your picture has 1) no further GPS coordinates and 2) the date 1970:01:01 00:00:00Z is the typical UNIX birth date.

Do you think it would be sufficient to allow the GPS date only if GPS coordinates are available? This would keep the basic assumption but will check it in detail...

@xemle
Copy link
Owner

xemle commented Jun 22, 2024

@jgyprime Since you reporting that you like to use 500k images: Please be aware of #134 which discusses some limits of HomeGallery with larger image count for the database

@jgyprime
Copy link
Author

@jgyprime Since you reporting that you like to use 500k images: Please be aware of #134 which discusses some limits of HomeGallery with larger image count for the database

After removing the gps date info (as I said above) the indexation has restarted.
Right now, it is indexing, it managed to index approximately 45k pictures... I do not know how long it will take, but I will let it finish.
I've already seen that discussion, if I reach any limitation, then I will try to figure out what limitation it has reached.

My NAS is a Terramster F4-421
Cpu: intel celeron j3455
Ram: 12 gb ddr3 (it came with 4 gb, I added another 8gb from an old laptop)
I ditched the proprietary os and installed a debian + utilities I need.
The main drive (os and utilities) is a 250 gb SSD.
The "storage" drive for the photos is a 8 tb WD Red Pro HDD.

@jgyprime
Copy link
Author

jgyprime commented Jun 23, 2024

Do you think it would be sufficient to allow the GPS date only if GPS coordinates are available? This would keep the basic assumption but will check it in detail...

Sure.
For me it is good enough.
Right now I am using the version I compiled by myself from source wuth my change.
For what I need, it is good enough.

@xemle
Copy link
Owner

xemle commented Jun 24, 2024

I've already seen that discussion, if I reach any limitation, then I will try to figure out what limitation it has reached.

Alright. Please push me if you reach problems. It bugs me that there is a problem which should not be there in theory. Since I do not face the problem I need an external push and someone who really want to have it solved.

Thank you for the details of your system. It helps to know the target systems.

For me it is good enough.
Right now I am using the version I compiled by myself from source wuth my change.

Awesome. Currently I am implementing a plugin system. When I stumble across this part I will ensure that the GPS date will only taken if there is also a GPS position.

In the meanwhile if you find a better strategy to identify the date, please let me know.

@jgyprime
Copy link
Author

jgyprime commented Jun 28, 2024

In the meantime, the indexation finished
I observed only ~100k photos were indexed.
When I searched for jpg files, I found ~400k photos
There are other formats there (png, gif and other).

I have a few questions:

  • is there any limitation to file / folder naming? long time ago I was organizing my photos in folders for each day I had photos, but the name was "YYYY.MM.DD"... after some time, I switched to "YYYY_MM_DD". I can see all files in folders that are named "YYYY_MM_DD" but none from folders ""YYYY.MM.DD". there are other folders (mostly from whatsapp that are stored in folders like this: Android/media/com.whatsapp/WhatsApp/Media/WhatsApp Images/Sent - note the "dot" in the path), they are not indexed either.
  • how is the software handling duplicate named files? For example, there might be photos taken with different phones but they have the same name (something like YYYYMMDD_HHMMSS.jpg) or even the same photo (taken with the same phone) but stored in many places (for example I want to share only specific files with friends and I copy them to another folder, or I have older backups that I did not yet manage yet to sort out and remove duplicated identical files). for example, I have a folder that is completely duplicated in another path, if I go to one of them, I can see all the files, if I go to the other one, I can see the files for a second, but they disappear immediately and no photos are displayed anymore.

@xemle
Copy link
Owner

xemle commented Jun 28, 2024

In the meantime, the indexation finished I observed only ~100k photos were indexed. When I searched for jpg files, I found ~400k photos There are other formats there (png, gif and other).

Do you have lots of binary duplicates? Do you have files which lead to the same SHA1 checksum?

* is there any limitation to file / folder naming?

No, there are no limits. Neither in file count nor in folder depth. All files should be considered.

Do you use any file filter which excludes some of the files?

* how is the software handling duplicate named files?

The file needs to be unique by OS filename for the file indexer and unique by SHA1 for the database. Same SHA1 is handled as duplicate and file data are merged.

There are corner cases with side cars of duplicate files, I can go in depth with that if requested.

But basically if you just copy a image/folder byte-by-byte from one place to another OS path these files are duplicates. Even if later if they are renamed since there file content is unchanged and contains the same data. This is a design decision with the goal to show only unique media by the assumption that most people have no clue how many duplicates they are storing and IMHO it does not give any value to show pictures twice.

To identify the files which are indexed you can dump information from index files *.idx like

zcat Picutures.idx | jq .data[].filename | wc -l

This should print the count of your files which should be about 400k according to your provided information.

To identify the entries from the database you can run

zcat database.db | jq .data[].id | wc -l

To identify unique database entries you can run

zcat database.db | jq .data[].id | sort -u | wc -l

The later should than print about 100k according to your provided information.

Maybe it is worth reading the internals of the gallery to gain further insights and to clarify further questions.

Thank you for reporting your experience and questions.

@xemle
Copy link
Owner

xemle commented Jun 28, 2024

is there any limitation to file / folder naming?

One more thing: HomeGallery imports the files in chunks to deal with internal limitations and to provide early feedback (show images in the browser). So the media import might also in a intermediate state and not all your files are imported yet?

This import process can be restarted and does not need to be run in one single run.

@xemle
Copy link
Owner

xemle commented Jul 30, 2024

Hi @jgyprime

I like to inform you that the newest master contains stream based database creation which requires less memory. So your 400K should be now fine to be processed and updated.

@xemle
Copy link
Owner

xemle commented Jul 30, 2024

@jgyprime Further, I am happy to announce the first experimental plugin feature in the current master! See docs.home-gallery.org/plugin for further details!

With plugins you can easily "fix" the geo data issue by your own database mapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants