-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add backend: ProQuest Federated Search Gateway #3991
base: dev
Are you sure you want to change the base?
Add backend: ProQuest Federated Search Gateway #3991
Conversation
Other than the TODOs, I think this is far enough to be worth a review. Also worth noting that there is no authentication config, it's simply IP-based access. |
Record display suggestionsDisplay Database in record Update: this is done! Display DOI in record Update: this is done! Hide date in Published field in record In every case, the 260 field (usually publisher name and publication date) provides identical or less specific date information than the date information that's present in the 773. (260 populates the Published area, and 773 populates the Published In area.) I suggest we try displaying ONLY 260, subfield b for the "Published" field. We'll still get the completeness from including the publisher name as well as the journal/publication date, and eliminate the redundancy and clutter of repeating the date. Update: this is done! |
@sturkel89 Most of this is unfortunately not possible -- for only facet available (database sources), the API lets you choose one or all, but not any arbitrary mix. So I could do a lightbox, but I'm not sure there would be any particular point? Re: sorting, I also have it hard-coded at the moment to sort by the number of results for each database -- is it needed to sort differently? |
@sturkel89 So this would be controlled by the record data formatter logic, but EDS (being special) appears to ignore that altogether. I see the default RecordDataFormatterFactory does include that Source field in the description tab. To add it the core as well, an implementer could mess with that class, or in RecordDataFormatter.ini could do something like
to get it to display at the bottom of the core fields. Only problem is, this would in theory affect all of the data sources that use Source (and RecordDataFormatter). Which in practice, is either no other data sources, or if there is another, is probably good for consistency. We could globally move Source up into the core fields in RecordDataFormatterFactory, but I think that's a separate PR since the scale is global. Do you think we should do this for ProQuest alone as a special case? |
* | ||
* @return mixed | ||
*/ | ||
public function getCleanDOI() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this method addresses @sturkel89's suggestion
Display DOI in record
If it's available, can we display the DOI in either the main part of the record or the Description tab? We have it in the Description field in EDS at Lehigh, and in both the main part of the record and in the Description tab in EDS at Villanova.
It shows up in the Description tab because that's globally configured for the DOI field in RecordDataFormatterFactory.
I think it's correct as just a string, i.e. not a clickable link to some DOI website. Please correct if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're not expecting links to be automatically applied here, though by exposing the DOI, you do open the possibility of inserting identifier-driven links when the appropriate settings are enabled (which is a good thing!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to see the DOI in the Description tab! Good improvement.
I think sorting the Source list by number of results per database is the best default. If filtering is absolutely not possible, then I guess the lightbox is moot. |
I think we should defer this whole issue to a separate PR (or JIRA ticket, if that would make more sense). Thanks for thinking this through! |
@sturkel89 Yeah the API really doesn't support filtering at all; database is specified as part of the URL path, or "all_subcribed". I do have some enhancement suggestions out to ProQuest but I am not holding my breath. I expect they would point people to the Primo Central API for richer functionality. EDIT: It's possible I can do something useful with the subject filters on database; though it wouldn't be separate facet.. will explore. |
This display logic is in templates/RecordDriver/DefaultRecord/result-list.phtml. I could make a ProQuest-specific version of that, but I think if we make this change it's best done in at the DefaultRecord level in a separate PR for more visibility. I do see that the EDS-specific record driver template displays volume, issue and page number, although that's because it's all just coming from a single EDS field rather than being an intentional choice, but in any case it would be a consistent change. IMHO I'm on the fence about this, it maybe information overload on the results page. |
* | ||
* @return string | ||
*/ | ||
public function getContainerTitle() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sturkel89 notes
Hide second copy of ebook titles on results page
When the item is an ebook, the 245|a (title) field tends to be identical to the 773|t (published in|title) field. This looks weird in record browse:
Can we do a comparison of those two fields, and if they're identical can we either suppress the 773 value from appearing in record browse, or can we replace the 773 with the 786 t (database source)?
This is happening because this getContainerTitle
method simply returns the 773|t. I hear what you are saying about how that looks silly for ebooks. I don't though like the idea of just doing a string comparison for equality; there is likely to be too much dirty data, and you could also end up with situations like a book chapter name that echoes the book title. I thought of just using the 786t anytime it's non-empty, but that is definitely wrong.
Maybe the answer is to do test explicitly if it's an (e)book and then to grab the 786t; I could use getOpenUrlFormat == 'Book'. But I'm not sure if that's a good idea, or a valid test for the idea. @demiankatz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my preference would be to ensure that getFormats() returns meaningful values, and then to rely on those, if such a thing is possible. I see that the current getFormats() of the MarcBasicTrait relies on 245$h, which is a totally useless and obsolete way of doing things (and a legacy of the old WorldCat v1 API). I'd be in favor of replacing that with something better, but that's probably a whole separate PR. I haven't delved into refreshing my memory on how getOpenUrlFormat works, but if that's a more practical solution in the short term, I'd be open to using it here as an interim solution and trying to improve getFormats as a separate effort. Is that any help? If not, I'll have to find time to get my hands dirtier digging into the code and MARC standards to give more concrete suggestions. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for now I'm fine using getOpenUrlFormat
to detect that it's a book, but what then?
I implemented the logic I suggested above (display the 786t if it's a book, otherwise the 773t as expected). But the template links the database source and does a journal search on it, which of course fails. I don't think it makes sense to display two different types of values in that one field.
For now I've gone back to @sturkel89 's other original suggestion to just suppress the container info if it's a book.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably the simplest thing for now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one concern has to do with dissertations, which seem to be passing the test and are seen as books - so they don't show the Published line that would identify them as dissertations. Examples of recent dissertations from ProQuest Dissertations & Theses Global:
Interestingly, when I switch the sorting and look at old ones, there are several that "fail" the test and display in the way that I think makes things better for the patron:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ummm - one other concern (but I'm not sure if your new code changed this display or not): when it's a book without a pub year in the record, AND the full text is not included so the link just says Citation/Abstract, it's a little ambiguous what you're looking at.
These are records from the database ARTbibliographies Modern. (It may just be that the metadata in these records is kind of thin, so we don't have many options.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one concern has to do with dissertations, which seem to be passing the test and are seen as books - so they don't show the Published line that would identify them as dissertations. Examples of recent dissertations from ProQuest Dissertations & Theses Global:
Yes, the dissertations are failing every check in getOpenUrlFormat
so they default to being considered books. The reason they fail all the checks is that it internally calls getFormats
which MarcBasicTrait defines to use 245|h, which as @demiankatz said above is obsolete and useless, and as he said we need a solution that's broader than just this record driver.
(I did look at the rest of the MARC just to see if there was anything else there that might be a format, other than some magic in the leader. The 513|a sometimes says "Book" at a useful time, but sometimes something else like "Prose Book Novel Translation" so I can't just match on "Book", and I can't look for it as a substring because that would catch things like "Book Review". So I don't see any solution.)
Ummm - one other concern (but I'm not sure if your new code changed this display or not): when it's a book without a pub year in the record, AND the full text is not included so the link just says Citation/Abstract, it's a little ambiguous what you're looking at.
These are records from the database ARTbibliographies Modern:
I don't think we have that ARTbibliographies Modern database. I do see the title "Hiratsuka: modern master" coming back from several other PQ databases, with more useful results.
Until we have some broader (not ProQuest-specific) logic to replace the 245|h, I think our only choices are either to leave this "book" logic in place and accept the side effects, or to remove it and deal with the duplication. I suppose we can also be more fine-grained and keep/remove the date and make the other decision for the journal "container" info. @sturkel89 what do you think?
public function getHumanReadablePublicationDates() | ||
{ | ||
$dates = $this->marcGetHumanReadablePublicationDates(); | ||
// For books, we should only display the year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addresses @sturkel89 's suggestion
Abbreviate date everywhere for ebook records
It looks weird to me to have the full publication date (month, day, year) appear in ebook records; it makes me think that the item is a newspaper article that is tied to a very specific publication date. Year would be sufficient! I think the pub year is present in the first four numeric characters in the 045 field. Would it be possible to make that replacement if the source is Ebook Central, or based on some other qualifier?
Currently the publication date info comes from 264 or 260, and the code (and presumably the MARC standard) seem to allow for multiple 26x fields to appear. Is that something we have to account for? I'm not sure we do, but if so, we can't just grab the 045. Although parsing that would certainly be easier than the "look for a comma" brittle logic I prototyped here.
Also note I'm applying this to items that appear to be books; not sure if that is the right qualifier. I don't really want the code dependent on what ProQuest happens to call their ebook collection today, even assuming there are no other ebooks returned via the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the 26x fields have gotten more complicated, allowing for different types of dates that may differ from one another. I think at least some of our code was written before the 264 field even existed, and subsequent updates attempted to simplify things as much as possible. But there are likely nuanced edge cases that aren't handled well. (I also can't remember if the logic I'm thinking of is in SolrMarc, our record drivers, or both).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the 045 documentation, it looks like that can include multiple dates as well. This metadata is way beyond my expertise, but I feel like at least with the 26x's we can easily handle that case. So I'm leaning towards the implementation I've built so far, brittle though it may be. If flaws turn up, we can either fix them or remove this date truncation logic altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change looks great. With book records, we consistently get only the year from the 045; with scholarly journals, we mostly see month and year; and with news sources, we get the full date.
However, we seem to sometimes see the full date for scholarly journal articles. It seems to happen when there is only one entry in the 045: in those cases, we get month, day, and year. If there are two subfield b entries in the 045, we get month and year (which I think is preferred). Is that by design? These are both records from APA PsycInfo:
URL= http://localhost/vufind_test/ProQuestFSGRecord/3092007636
URL = http://localhost/vufind_test/ProQuestFSGRecord/3093180948
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I think I wasn't clear above, when I said
So I'm leaning towards the implementation I've built so far, brittle though it may be.
I meant keeping the 26x logic
Currently the publication date info comes from 264 or 260, and the code (and presumably the MARC standard) seem to allow for multiple 26x fields to appear.
In the two records above, the 260 data is just different, it has the full date for the second record and only month + year for the first one.
Are you advocating that I should look to the 045 instead? If so I think I would need to figure out how to deal with its variations when it's not just a single date.
…error" including you are not on a valid IP
* @throws BackendException | ||
* @return void | ||
*/ | ||
public function checkForSRUDiagnosticsError($resultBody) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sturkel89 pointed out that if you are not on the correct IP range, the error was hidden and there were just zero results. It turns out that SRU likes to return an HTTP 200 no matter what goes wrong; there is a list linked above of dozens of different error codes, including '3' for authentication problems.
This is a general fix to make any such "diagnostic" messages throw an error akin to an HTTP 4xx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would think a "diagnostic" message might be benign debug content; but looking at the list, apparently it's always an error of some kind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see text suggesting that you don't have access to ProQuest content because you are not on the campus network, but this may be the best we can do for now.
No that's just because we're both running VuFind on our local machines. In a production context, the error would be because the VuFind server was not in ProQuest's configured IP range. Would not matter whether the user had VPN or not.
And FWIW if you turn on development mode in your apache config, the error is more specific (and ugly). I unfortunately can't generate a screenshot of that today because I'm on campus.
Probably best to keep things simple for now; I agree that it would be better to implement (perhaps configurable) base template changes rather than creating a custom ProQuest-specific template. But if what we're doing now is not actively incorrect, it's probably easiest to avoid adding complexity in this initial PR and instead tackle that separately as needed as a follow-up. |
I think this is at a good functional point now.
So I think this is ready for some final reviews and whatever cleanup. |
Thanks for all of the incremental improvements! This is looking a lot better than version 1.0. It's probably just about ready to merge (perhaps pending a response to my question about how date display relates to number of 045 fields). However, I just noticed one other thing that is probably a bug. (I showed Demian and he agrees.) Here it is:
First page of results after viewing a record and then going back to see the results list: Later pages of results after viewing a record and going back to see the results list: |
@sturkel89 This is a good catch. As you and @demiankatz say this is a bug. When I query the underlying API with any I'm going to report the issue to ProQuest, but I'm still waiting on a response to an email from November so not holding my breath. In the meantime I could fix this by doing a duplicate query every time for the facet sidebar, but that seems wasteful when probably 90% of faceting will happen on page 1? Could be convinced otherwise. |
Hard to say what the typical user would do. My preference is to always have the filters available, esp. since PQ can provide such a firehose of results. If the user enters a simple search and gets a huge results set, it may take a couple of screens of browsing for them to realize that they should probably limit the results SOMEHOW. |
The ProQuest Federated Search Gateway is an SRU API for searching across research databases licensed via ProQuest. It's old (docs last updated 2016) but still works and per ProQuest customer service:
On the plus side, it has a fairly rich CQL syntax for search. On the minus, there are no facets offered other than the constituent databases.
Implementation is patterned after the soon-to-be-deleted WorldCat backend as they both use SRU.
TODO