Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backend: ProQuest Federated Search Gateway #3991

Open
wants to merge 67 commits into
base: dev
Choose a base branch
from

Conversation

maccabeelevine
Copy link
Member

@maccabeelevine maccabeelevine commented Oct 7, 2024

The ProQuest Federated Search Gateway is an SRU API for searching across research databases licensed via ProQuest. It's old (docs last updated 2016) but still works and per ProQuest customer service:

The Federated Gateway is still in use ... There are no plans to discontinue it and we would be fine with them posting the VuFind code for integration for other customers.

On the plus side, it has a fairly rich CQL syntax for search. On the minus, there are no facets offered other than the constituent databases.

Implementation is patterned after the soon-to-be-deleted WorldCat backend as they both use SRU.

TODO

@maccabeelevine maccabeelevine marked this pull request as ready for review November 6, 2024 16:34
@maccabeelevine
Copy link
Member Author

Other than the TODOs, I think this is far enough to be worth a review. Also worth noting that there is no authentication config, it's simply IP-based access.

@sturkel89
Copy link
Contributor

sturkel89 commented Dec 13, 2024

Record display suggestions

Display Database in record
Can we add the Database/Source to the record view, toward the bottom of the main section (as in EDS at Lehigh and at Villanova)?

Update: this is done!


Display DOI in record
If it's available, can we display the DOI in either the main part of the record or the Description tab? We have it in the Description field in EDS at Lehigh, and in both the main part of the record and in the Description tab in EDS at Villanova.

Update: this is done!


Hide date in Published field in record
I've reviewed the display of publication information in a lot of records for a variety of publication types.

In every case, the 260 field (usually publisher name and publication date) provides identical or less specific date information than the date information that's present in the 773. (260 populates the Published area, and 773 populates the Published In area.)

I suggest we try displaying ONLY 260, subfield b for the "Published" field. We'll still get the completeness from including the publisher name as well as the journal/publication date, and eliminate the redundancy and clutter of repeating the date.

Update: this is done!

@sturkel89
Copy link
Contributor

sturkel89 commented Dec 13, 2024

Results browse suggestions

Display full Published in data on results page
I'd like to see FULL "Published In" data from the 773 displayed in results browse, rather than only the publication name and date. If you implement this, it would add volume, issue, and page number information where available.

Update: This is not done for various reasons and won't be done in this PR.


Hide second copy of ebook titles on results page
When the item is an ebook, the 245|a (title) field tends to be identical to the 773|t (published in|title) field. This looks weird in record browse:

Can we do a comparison of those two fields, and if they're identical can we either suppress the 773 value from appearing in record browse, or can we replace the 773 with the 786 t (database source)?

Update: this is done for Ebook Central records and probably others!


Abbreviate date everywhere for ebook records
It looks weird to me to have the full publication date (month, day, year) appear in ebook records; it makes me think that the item is a newspaper article that is tied to a very specific publication date. Year would be sufficient! I think the pub year is present in the first four numeric characters in the 045 field. Would it be possible to make that replacement if the source is Ebook Central, or based on some other qualifier?

Update: this is done for Ebook Central records and probably others!

@maccabeelevine
Copy link
Member Author

Source lightbox?
I'd like to be able to bring up a lightbox so I could sort and filter the list of database sources, as we can sort and filter some facet groups in "regular" VuFind. This would make it easier to include or exclude the groups of historical newspaper sources, for example, and would be extra-great when we have the ability to apply multi-filter selection and deselection.

Is it possible? (NB: This functionality doesn't seem to be available for any facet group in EDS, so maybe it's not possible when there are huge lists of facet values.)

@sturkel89 Most of this is unfortunately not possible -- for only facet available (database sources), the API lets you choose one or all, but not any arbitrary mix. So I could do a lightbox, but I'm not sure there would be any particular point? Re: sorting, I also have it hard-coded at the moment to sort by the number of results for each database -- is it needed to sort differently?

@maccabeelevine
Copy link
Member Author

Display Database in record
Can we add the Database/Source to the record view, toward the bottom of the main section (as in EDS at Lehigh and at Villanova)?

@sturkel89 So this would be controlled by the record data formatter logic, but EDS (being special) appears to ignore that altogether. I see the default RecordDataFormatterFactory does include that Source field in the description tab. To add it the core as well, an implementer could mess with that class, or in RecordDataFormatter.ini could do something like

[Defaults]
core[] = "Source"

[Field_Source]
dataMethod = "getSource"
pos = 10000

to get it to display at the bottom of the core fields. Only problem is, this would in theory affect all of the data sources that use Source (and RecordDataFormatter). Which in practice, is either no other data sources, or if there is another, is probably good for consistency.

We could globally move Source up into the core fields in RecordDataFormatterFactory, but I think that's a separate PR since the scale is global.

Do you think we should do this for ProQuest alone as a special case?

*
* @return mixed
*/
public function getCleanDOI()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this method addresses @sturkel89's suggestion

Display DOI in record
If it's available, can we display the DOI in either the main part of the record or the Description tab? We have it in the Description field in EDS at Lehigh, and in both the main part of the record and in the Description tab in EDS at Villanova.

It shows up in the Description tab because that's globally configured for the DOI field in RecordDataFormatterFactory.

I think it's correct as just a string, i.e. not a clickable link to some DOI website. Please correct if not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we're not expecting links to be automatically applied here, though by exposing the DOI, you do open the possibility of inserting identifier-driven links when the appropriate settings are enabled (which is a good thing!).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see the DOI in the Description tab! Good improvement.

@sturkel89
Copy link
Contributor

Source lightbox?
I'd like to be able to bring up a lightbox so I could sort and filter the list of database sources, as we can sort and filter some facet groups in "regular" VuFind. This would make it easier to include or exclude the groups of historical newspaper sources, for example, and would be extra-great when we have the ability to apply multi-filter selection and deselection.

Is it possible? (NB: This functionality doesn't seem to be available for any facet group in EDS, so maybe it's not possible when there are huge lists of facet values.)

@sturkel89 Most of this is unfortunately not possible -- for only facet available (database sources), the API lets you choose one or all, but not any arbitrary mix. So I could do a lightbox, but I'm not sure there would be any particular point? Re: sorting, I also have it hard-coded at the moment to sort by the number of results for each database -- is it needed to sort differently?

I think sorting the Source list by number of results per database is the best default.

If filtering is absolutely not possible, then I guess the lightbox is moot.

@sturkel89
Copy link
Contributor

Display Database in record
Can we add the Database/Source to the record view, toward the bottom of the main section (as in EDS at Lehigh and at Villanova)?

@sturkel89 So this would be controlled by the record data formatter logic, but EDS (being special) appears to ignore that altogether. I see the default RecordDataFormatterFactory does include that Source field in the description tab. To add it the core as well, an implementer could mess with that class, or in RecordDataFormatter.ini could do something like

[Defaults]
core[] = "Source"

[Field_Source]
dataMethod = "getSource"
pos = 10000

to get it to display at the bottom of the core fields. Only problem is, this would in theory affect all of the data sources that use Source (and RecordDataFormatter). Which in practice, is either no other data sources, or if there is another, is probably good for consistency.

We could globally move Source up into the core fields in RecordDataFormatterFactory, but I think that's a separate PR since the scale is global.

Do you think we should do this for ProQuest alone as a special case?

I think we should defer this whole issue to a separate PR (or JIRA ticket, if that would make more sense). Thanks for thinking this through!

@maccabeelevine
Copy link
Member Author

maccabeelevine commented Jan 22, 2025

If filtering is absolutely not possible, then I guess the lightbox is moot.

@sturkel89 Yeah the API really doesn't support filtering at all; database is specified as part of the URL path, or "all_subcribed".
image

I do have some enhancement suggestions out to ProQuest but I am not holding my breath. I expect they would point people to the Primo Central API for richer functionality.

EDIT: It's possible I can do something useful with the subject filters on database; though it wouldn't be separate facet.. will explore.

@maccabeelevine
Copy link
Member Author

Display full Published in data on results page
I'd like to see FULL "Published In" data from the 773 displayed in results browse, rather than only the publication name and date. If you implement this, it would add volume, issue, and page number information where available.

This display logic is in templates/RecordDriver/DefaultRecord/result-list.phtml. I could make a ProQuest-specific version of that, but I think if we make this change it's best done in at the DefaultRecord level in a separate PR for more visibility. I do see that the EDS-specific record driver template displays volume, issue and page number, although that's because it's all just coming from a single EDS field rather than being an intentional choice, but in any case it would be a consistent change. IMHO I'm on the fence about this, it maybe information overload on the results page.

*
* @return string
*/
public function getContainerTitle()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sturkel89 notes

Hide second copy of ebook titles on results page
When the item is an ebook, the 245|a (title) field tends to be identical to the 773|t (published in|title) field. This looks weird in record browse:

Can we do a comparison of those two fields, and if they're identical can we either suppress the 773 value from appearing in record browse, or can we replace the 773 with the 786 t (database source)?

This is happening because this getContainerTitle method simply returns the 773|t. I hear what you are saying about how that looks silly for ebooks. I don't though like the idea of just doing a string comparison for equality; there is likely to be too much dirty data, and you could also end up with situations like a book chapter name that echoes the book title. I thought of just using the 786t anytime it's non-empty, but that is definitely wrong.

Maybe the answer is to do test explicitly if it's an (e)book and then to grab the 786t; I could use getOpenUrlFormat == 'Book'. But I'm not sure if that's a good idea, or a valid test for the idea. @demiankatz ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference would be to ensure that getFormats() returns meaningful values, and then to rely on those, if such a thing is possible. I see that the current getFormats() of the MarcBasicTrait relies on 245$h, which is a totally useless and obsolete way of doing things (and a legacy of the old WorldCat v1 API). I'd be in favor of replacing that with something better, but that's probably a whole separate PR. I haven't delved into refreshing my memory on how getOpenUrlFormat works, but if that's a more practical solution in the short term, I'd be open to using it here as an interim solution and trying to improve getFormats as a separate effort. Is that any help? If not, I'll have to find time to get my hands dirtier digging into the code and MARC standards to give more concrete suggestions. ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for now I'm fine using getOpenUrlFormat to detect that it's a book, but what then?

I implemented the logic I suggested above (display the 786t if it's a book, otherwise the 773t as expected). But the template links the database source and does a journal search on it, which of course fails. I don't think it makes sense to display two different types of values in that one field.

For now I've gone back to @sturkel89 's other original suggestion to just suppress the container info if it's a book.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably the simplest thing for now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! These examples from Ebook Central previously listed the full title a second time (and month and day parts of the date, in addition to year) and now they look much cleaner:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one concern has to do with dissertations, which seem to be passing the test and are seen as books - so they don't show the Published line that would identify them as dissertations. Examples of recent dissertations from ProQuest Dissertations & Theses Global:


Interestingly, when I switch the sorting and look at old ones, there are several that "fail" the test and display in the way that I think makes things better for the patron:

Copy link
Contributor

@sturkel89 sturkel89 Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm - one other concern (but I'm not sure if your new code changed this display or not): when it's a book without a pub year in the record, AND the full text is not included so the link just says Citation/Abstract, it's a little ambiguous what you're looking at.

These are records from the database ARTbibliographies Modern. (It may just be that the metadata in these records is kind of thin, so we don't have many options.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one concern has to do with dissertations, which seem to be passing the test and are seen as books - so they don't show the Published line that would identify them as dissertations. Examples of recent dissertations from ProQuest Dissertations & Theses Global:

Yes, the dissertations are failing every check in getOpenUrlFormat so they default to being considered books. The reason they fail all the checks is that it internally calls getFormats which MarcBasicTrait defines to use 245|h, which as @demiankatz said above is obsolete and useless, and as he said we need a solution that's broader than just this record driver.

(I did look at the rest of the MARC just to see if there was anything else there that might be a format, other than some magic in the leader. The 513|a sometimes says "Book" at a useful time, but sometimes something else like "Prose Book Novel Translation" so I can't just match on "Book", and I can't look for it as a substring because that would catch things like "Book Review". So I don't see any solution.)

Ummm - one other concern (but I'm not sure if your new code changed this display or not): when it's a book without a pub year in the record, AND the full text is not included so the link just says Citation/Abstract, it's a little ambiguous what you're looking at.
These are records from the database ARTbibliographies Modern:

I don't think we have that ARTbibliographies Modern database. I do see the title "Hiratsuka: modern master" coming back from several other PQ databases, with more useful results.

image

Until we have some broader (not ProQuest-specific) logic to replace the 245|h, I think our only choices are either to leave this "book" logic in place and accept the side effects, or to remove it and deal with the duplication. I suppose we can also be more fine-grained and keep/remove the date and make the other decision for the journal "container" info. @sturkel89 what do you think?

public function getHumanReadablePublicationDates()
{
$dates = $this->marcGetHumanReadablePublicationDates();
// For books, we should only display the year
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addresses @sturkel89 's suggestion

Abbreviate date everywhere for ebook records
It looks weird to me to have the full publication date (month, day, year) appear in ebook records; it makes me think that the item is a newspaper article that is tied to a very specific publication date. Year would be sufficient! I think the pub year is present in the first four numeric characters in the 045 field. Would it be possible to make that replacement if the source is Ebook Central, or based on some other qualifier?

Currently the publication date info comes from 264 or 260, and the code (and presumably the MARC standard) seem to allow for multiple 26x fields to appear. Is that something we have to account for? I'm not sure we do, but if so, we can't just grab the 045. Although parsing that would certainly be easier than the "look for a comma" brittle logic I prototyped here.

Also note I'm applying this to items that appear to be books; not sure if that is the right qualifier. I don't really want the code dependent on what ProQuest happens to call their ebook collection today, even assuming there are no other ebooks returned via the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 26x fields have gotten more complicated, allowing for different types of dates that may differ from one another. I think at least some of our code was written before the 264 field even existed, and subsequent updates attempted to simplify things as much as possible. But there are likely nuanced edge cases that aren't handled well. (I also can't remember if the logic I'm thinking of is in SolrMarc, our record drivers, or both).

Copy link
Member Author

@maccabeelevine maccabeelevine Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the 045 documentation, it looks like that can include multiple dates as well. This metadata is way beyond my expertise, but I feel like at least with the 26x's we can easily handle that case. So I'm leaning towards the implementation I've built so far, brittle though it may be. If flaws turn up, we can either fix them or remove this date truncation logic altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change looks great. With book records, we consistently get only the year from the 045; with scholarly journals, we mostly see month and year; and with news sources, we get the full date.

However, we seem to sometimes see the full date for scholarly journal articles. It seems to happen when there is only one entry in the 045: in those cases, we get month, day, and year. If there are two subfield b entries in the 045, we get month and year (which I think is preferred). Is that by design? These are both records from APA PsycInfo:

URL= http://localhost/vufind_test/ProQuestFSGRecord/3092007636

URL = http://localhost/vufind_test/ProQuestFSGRecord/3093180948

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think I wasn't clear above, when I said

So I'm leaning towards the implementation I've built so far, brittle though it may be.

I meant keeping the 26x logic

Currently the publication date info comes from 264 or 260, and the code (and presumably the MARC standard) seem to allow for multiple 26x fields to appear.

In the two records above, the 260 data is just different, it has the full date for the second record and only month + year for the first one.

Are you advocating that I should look to the 045 instead? If so I think I would need to figure out how to deal with its variations when it's not just a single date.

* @throws BackendException
* @return void
*/
public function checkForSRUDiagnosticsError($resultBody)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sturkel89 pointed out that if you are not on the correct IP range, the error was hidden and there were just zero results. It turns out that SRU likes to return an HTTP 200 no matter what goes wrong; there is a list linked above of dozens of different error codes, including '3' for authentication problems.

This is a general fix to make any such "diagnostic" messages throw an error akin to an HTTP 4xx.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would think a "diagnostic" message might be benign debug content; but looking at the list, apparently it's always an error of some kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run a search off the VPN, I now get this 'red box' error. I think this is better than getting zero results. It would be nice to see text suggesting that you don't have access to ProQuest content because you are not on the campus network, but this may be the best we can do for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see text suggesting that you don't have access to ProQuest content because you are not on the campus network, but this may be the best we can do for now.

No that's just because we're both running VuFind on our local machines. In a production context, the error would be because the VuFind server was not in ProQuest's configured IP range. Would not matter whether the user had VPN or not.

And FWIW if you turn on development mode in your apache config, the error is more specific (and ugly). I unfortunately can't generate a screenshot of that today because I'm on campus.

@demiankatz
Copy link
Member

This display logic is in templates/RecordDriver/DefaultRecord/result-list.phtml. I could make a ProQuest-specific version of that, but I think if we make this change it's best done in at the DefaultRecord level in a separate PR for more visibility. I do see that the EDS-specific record driver template displays volume, issue and page number, although that's because it's all just coming from a single EDS field rather than being an intentional choice, but in any case it would be a consistent change. IMHO I'm on the fence about this, it maybe information overload on the results page.

Probably best to keep things simple for now; I agree that it would be better to implement (perhaps configurable) base template changes rather than creating a custom ProQuest-specific template. But if what we're doing now is not actively incorrect, it's probably easiest to avoid adding complexity in this initial PR and instead tackle that separately as needed as a follow-up.

@maccabeelevine
Copy link
Member Author

I think this is at a good functional point now.

So I think this is ready for some final reviews and whatever cleanup.

@sturkel89
Copy link
Contributor

Thanks for all of the incremental improvements! This is looking a lot better than version 1.0. It's probably just about ready to merge (perhaps pending a response to my question about how date display relates to number of 045 fields).


However, I just noticed one other thing that is probably a bug. (I showed Demian and he agrees.)

Here it is:

  • When you run a search and view results, you see the Source filter on every page of search results as you page through the search results list.
  • When you're on the first page of search results, view a record, and then use the browser's back arrow or click Search in the breadcrumbs to go back to the full results list, the Source filter is visible on the results page.
  • BUT! When you are on a later page of search results, view one record, and then go back to the search results list, the Source filter disappears!

First page of results after viewing a record and then going back to see the results list:

image

Later pages of results after viewing a record and going back to see the results list:

image

@maccabeelevine
Copy link
Member Author

maccabeelevine commented Jan 28, 2025

BUT! When you are on a later page of search results, view one record, and then go back to the search results list, the Source filter disappears!

@sturkel89 This is a good catch. As you and @demiankatz say this is a bug. When I query the underlying API with any startRecord param, it refuses to pass me back any facets. The reason the bug doesn't appear when you page normally is that it's using AJAX to retrieve the next few records, and leaving the sidebar alone -- if you turn off load_results_with_js then the problem happens even with normal paging.

I'm going to report the issue to ProQuest, but I'm still waiting on a response to an email from November so not holding my breath. In the meantime I could fix this by doing a duplicate query every time for the facet sidebar, but that seems wasteful when probably 90% of faceting will happen on page 1? Could be convinced otherwise.

@sturkel89
Copy link
Contributor

BUT! When you are on a later page of search results, view one record, and then go back to the search results list, the Source filter disappears!

@sturkel89 This is a good catch. As you and @demiankatz say this is a bug. When I query the underlying API with any startRecord param, it refuses to pass me back any facets. The reason the bug doesn't appear when you page normally is that it's using AJAX to retrieve the next few records, and leaving the sidebar alone -- if you turn off load_results_with_js then the problem happens even with normal paging.

I'm going to report the issue to ProQuest, but I'm still waiting on a response to an email from November so not holding my breath. In the meantime I could fix this by doing a duplicate query every time for the facet sidebar, but that seems wasteful when probably 90% of faceting will happen on page 1? Could be convinced otherwise.

Hard to say what the typical user would do. My preference is to always have the filters available, esp. since PQ can provide such a firehose of results. If the user enters a simple search and gets a huge results set, it may take a couple of screens of browsing for them to realize that they should probably limit the results SOMEHOW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants