-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify a FileSet's membership #57
Comments
I don't want a FileSet to have to live as a member of an Object. I have many use cases where free standing FileSet with descriptive metadata is enough. It will not have child objects nor alternate filesets. A FileSet is a very useful concept, but I do not want to enforce the works structure. |
^^ Just a clarification, this only pertains to FileSet getting brought in with '2.0' |
As per discussion in #59, in 2.0 a FileSet should be an independent entity with M:M relationships with Objects or Collections. |
Hey, @dannylamb, would you mind sharing some of those use cases? |
We have to provide off the shelf implementations for smaller, less complex objects like simple images or an audio stream. Essentially just a pres master, one or more derivatives, and some descriptive metadata.
Lather, rinse, repeat for audio, video, generic binaries, etc... I only want to use a pcdm:Object when there's a need to aggregate other objects or filesets. This comes into play for books, newspapers, serials, etc... |
@dannylamb OK, thanks for sharing that use case! Let me poke at this a little more, as someone who, like you, also started out somewhat skeptical of the proposed PCDM structure. In your use case, do you see the FileSet as a mere grouping of files or also as a surrogate for an "intellectual object," so to speak? The fact that you might be asserting a bunch of descriptions about the FileSet makes me wonder whether you're coercing the two different concepts together. Back when we collectively started down the road of implementing PCDM, which was important and timely for me as a developer of Sufia and of Penn State's ScholarSphere repository, I was initially resistant to the idea of needing the two-level (Object/Work & FileSet) hierarchy despite agreeing that ultimately we're after developing a common data model that allows interoperability across a wide variety of content types, use cases, and repository systems. Viewing this through the cultural heritage-oriented lens of long-term stewardship, though, I wonder if conflating the concepts of a file bundle and an intellectual entity sets us up for headaches down the road. Assume if you will that your content will be aggregated and re-used by other folks, and assume that your content will be built upon and added to fifteen or fifty years from now. If someone wants to add a file to the intellectual entity which is hidden inside the FileSet, the task is more complex than adding a FileSet to an existing Object -- someone would need to know the FileSet is more than just a file bundle, create a new object (with a brand new URI that did not previously exist on the linked open data web), copy the object's descriptive metadata from the FileSet to the object, etc. I don't believe the above is science fiction. Rather, I believe this is the world we are all hoping to build together. What is the overriding reason for encouraging divergence here? Are there specific performance, modeling, or technical reasons for not wanting the extra layer? |
@mjgiarlo Thanks for the insight into your community's reasoning. And I want to make clear that if FileSet were to remain in the Works extension, then I have no issue with your interpretation of the aggregation. But if you are to import it into the core ontology, I must flatly reject any interpretation beyond restricting the range of ore:aggregates to files. I want the core ontology to be open to extension and refinement, just as you are doing now. And that means with as few rules as possible. By merging the full restrictions imposed by the Works extension into the core ontology, PCDM will be converted into Hydra's application profile. And this will absolutely discourage the use of PCDM outside of the Hydra community. Moreover, interoperability through a shared application profile will be highly problematic. Members of this community are not all working on the same piece of software. And deciding for us all to fuse at the hip in the 'Fedora as a database' layer will only grow more painful and costly as our applications inevitably diverge over time. Why would each of our communities set themselves up for unexpected code updates and data migrations because the other requires changes for new features? That raises serious questions of sovereignty for me. I must be given the freedom to support PCDM without allowing a community other than my own to disrupt scheduling or development. Like you, I know interoperability is a realizable goal. But not by forcing a shared application profile. It's a recipe for disaster. We develop APIs specifically to avoid the pitfalls of that approach. Is it not more in line with the goals of both of our applications to inter-operate through publishing and consuming linked data? Shouldn't we be seeking integration through the semantic web? And then when changes eventually occur, and PCDM 3.0 comes down the pipe, every single Islandora user won't be forced to migrate all of their content in order to remain compatible. |
@dannylamb The intention here is not to impose any notions from Hydra or Fedora, but to share an insight we had in the Hydra PCDM implementation (that groups of Files really aren't a pcdm:Object, but their own thing with their own semantics), and try to find common ground with everyone else involved in PCDM to see if it's a useful thing to include in the core ontology. When I wrote up the original PCDM 2.0 document, I was trying to codify the discussions we had at LDCX, where I thought we had a consensus on this approach. For my part, I was very skeptical of requiring FileSets, and in particular, requiring they be separate from the Page/Component/Part/Whatever they are representing. I agree that most use cases won't have multiple FileSets, and that in practice, you can usually just subsume the group-of-files part into the parent object and be done with it. But I've been convinced that a FileSet shouldn't be a pcdm:Object — it represents a different thing, with its own creation metadata, and from a completely abstract point-of-view should be a different thing. And there are use cases (including some in my own org's collections that we haven't gotten to yet) where having multiple FileSets representing a single Object will be great. So from a practical standpoint, I'd like a community-supported way of handling that. And I think the way Files are grouped and linked is too essential to PCDM to have variation. There are a lot of things I'd be happy to have in an extension (e.g., #60, or File Use). If we can't agree on how to support the use cases we have in front of us in a consistent way, then I have a hard time seeing how we go forward with PCDM. |
In your opinion, how should we attain interoperability then? What's Islandora's goal for "level of interoperability?" I can't speak for Hydra, but I would want "can write a PCDM ingester, and it gets an Islandora object in Hydra." Is the intended level of interop rather "when we look at this structure, it makes sense"?
I don't think any of us intend for that to be the case.
I agree this is a problem, and we're all trying to find the right balance for this too. It seems like Islandora's opinion is that PCDM is done, yes? Objects have Objects, and some of those are collections. If that's the case, then I think we should just gather around that. We lose the ability to automatically crosswalk to other structural schemes (you can only crosswalk without local assumptions if both describe the same levels of structure), but at least PCDM sticks around. So I think my proposal is this:
|
thanks for the list of possible action items, @tpendragon. a few questions:
So does this idea by @escowles become just a Hydra recommended implementation? And what you propose here, is that what this group is agreeing re:Filesets, i.e.:
This ties into to the proposed profiles work, IMHO. Happy to email the group to start gathering steam on this with PCDM as it stands now. Might help crystallize stuff being discussed.
Are there links to these things posted, or are they in the discussion of other issues? #60 has suspiciously little discussion on the issue thread, so this seems like perhaps it got hammered out via side channels. Just for transparency sake, it'd be good to link to those on the specific issue thread.
Looking at this in the context of @dannylamb 's comment here:
I agree with the point that the restrictions we do not want to add to PCDM is everything, especially as a group, in the HydraWorks application profile. But a review of various app profiles is a good place to find + single out proposals, implicit understandings, etc. for modeling discussion - IMHO. For #61, I think we are now clarifying what restrictions everyone does want in PCDM. By discussing the specific cases we do want to (or need to) add, we can get to the how. And this leads to me continuing to ask boring, state of discussion questions - like with point 1. |
Yes.
I'm not proposing that - is there tension around hasFile too?
Necessarily, since pcdm:FileSets would remain pcdm:Objects. pcdm:Files aren't subclasses of Object, and have their own predicate - hasFile.
Good point, I retract my statement. Discussion still needs to happen there. |
I'm puzzled by your response, Danny. I offered (what I thought were) non-Hydra-specific thoughts about why our communities might not want to conflate notions of file bundles with intellectual entities with an eye towards long-term stewardship and interoperability, asking questions about your use cases. What I am reading in your response is:
All are valid concerns worth discussing -- I'm mostly puzzled that the non-Hydra-specific questions I asked elicited them, when what I wanted was greater understanding of where you're coming from re: your use cases, and where we're going as a community of folks adopting a shared data model. So, in response to the three good issues raised above:
|
I think maybe I don't understand the specific metadata that would be stored on the FileSets object if there is only one FileSets object and how this would differ from the metadata that would be stored on either the Object or the actual File. How do you see that container being of use or maybe what would you store on the FileSet in a single FileSet scenario? |
@whikloj Howdy, Jared. Good questions.
The only use I have around descriptive metadata on a FileSet is to assert a label describing what the FileSet bundles together. You can see this in the three use cases expressed here: https://github.com/hybox/models/tree/master/notes This differs from the descriptive metadata on the Object of which the FileSet is a member, because the Object's descriptive metadata describes the intellectual entity, so the book or the photograph or the monograph that has the FileSet as a member.
I assume "that container" means the FileSet? As to why I find it useful to separate the FileSet from the Object, see my comment above. |
@mjgiarlo Correct I did mean the FileSet, sorry for the ambiguity there. Based on your answer and the notes (I think this one was the most relevant) So (if I am understanding the argument) the benefit is that of the separation of physical object from virtual representation of that object. So for the question
I see it as a mere grouping of files, which is why I'm not super excited about plugging it in all my resources. We are perhaps behind in our handling of linked data and digitization. This distinction between Real World and Digital is not a concern of anyone I deal with here, not that it is not a valid concern and/or consideration, just that no one handing out projects seems as worried about it as they are with getting the digitized content up and accessible. But if a pcdm:FileSet is a pcdm:Object, then if we later want to improve our resources to apply them as "digital representations" to a Real World Object. Could we not attach our pcdm:Objects (with the associated pcdm:Files) to this new pcdm:Object (lets call it pcdm:RealObject)? Then our old pcdm:Object becomes a stand-in for a pcdm:FileSet (heck we can patch it and change it later), but with the benefit of delaying that decision until we are sure what the RWO will be? 'Cause as I said, we aren't having those discussions right now. Thought? |
Hi, @whikloj.
Ha, no, I consider that (the distinction between RWOs and digital representation) a separate issue though it's related. The benefit is in keeping file bundlings separate from the intellectual objects that contain them. A while back, I seem to recall you sharing that your primary use case is newspapers. Do I have that right? The distinction here would be disentangling any metadata about a page image from metadata about the page (the digital page object, not the RWO). Why?
I suspect examples will help us more than any additional yada yada I may have to offer. If you have an object that has three distinct files, each of which has a label and 1-2 derivatives, how would you model that in PCDM?
I imagine you can but... oof, so many "objects." We need better language. ;) Would you mind sketching this out a bit (either in prose/snippets or visually, or whatever)? |
Ok @mjgiarlo, I think I understand your reasoning (or I'm getting a better grasp of it). I think essentially my concern is one of sad pragmatic concerns when I compare them with your "preserving the world's knowledge for future generations." What I mean is, stop making me look bad 😉 But long story short I can see a use case for FileSets in newspapers, I think @dannylamb had already seen one for compound objects. So if our current (Fedora 3) structure is: Then I can see a new structure using pcdm:FileSets like: The extra FileSet on the newspaper and issue for the sake of a thumbnail is just extra objects that add up to eat into our storage... and unfortunately this is a concern for me. |
@whikloj Why not reuse thumbnail resources? The |
@scossu that is definitely an option, and I was thinking that same thing once I started drawing it out. But I left the question up because for something like a Collection, you may not want to use an existing thumbnail in that case. |
@scossu no you are correct that in the case of a newspaper, I should shift to using a pointer to the existing thumbnail. Issue derives from 1st page, and Newspaper... gets one from some issue (not always the first one). But I'm not sure that I am getting any benefit from the FileSet on the newspaper level object. Even if I want to use a custom thumbnail I can just store the thumbnail with a hasFile, as there is almost no chance of another person adding a different version. But if it does happen, then I incur the additional costs of creating a FileSet and attaching it to the newspaper level object and moving the Files (read: thumbnail) into the FileSet. Because I don't see that happening too often, I save that storage and complexity until it is needed. That's what I am wondering, about delaying that complexity until I actually need it. |
OK, thanks for sketching that out, @whikloj! One discrepancy I noted between your original diagram and what @scossu drew is that he turned Pages into objects and you had them as FileSets. (FWIW, I'd probably do what @scossu diagrammed so that you can assert descriptive metadata about the Page independent of metadata about the files bundled together as representations of the Page.) Agree too that it'd make sense for the Newspaper and Issue objects to link to thumbnails contained with Page-level FileSets if that'd work for your use cases.
That's the central question of this issue, I suppose, eh? :) Whether you support FileSets as a required layer between Objects and Files probably depends on how you feel about:
I find those compelling and thus I'm down with adding FileSets as a new, required layer (though what I'm not saying, and I'm not hearing, is that Hydra requires this to be so to make use of PCDM). I'm guessing you find them less compelling, thus you'd find the proposition behind this issue less compelling. Case in points? |
I think that is really it from our -- Islandora -- perspective. I think I can honestly say that we do see the usefulness of FileSets, and they can come in handy when they are needed. But, requiring them, or making them mandatory, is not useful from our perspective. It is an extra layer of complexity, and overhead that is not needed 100% of the time. |
Thanks, @ruebot! This has been 👀-opening, in a good way. :) |
Discussion of FileSets has moved on — closing this issue. There is still work going on in the Hydra community about how FileSets should work, and what they represent, and making that compatible with the core ontology. |
A FileSet is the member of only one Object. So, Object:FileSet is 1:M and not M:M. I think these needs to be explicitly stated.
The text was updated successfully, but these errors were encountered: