Allow users to scrap data from PubMed articles (or other Databases supported by E-utilities although the script is made to be used with PubMed) in an organized and efficient way. Data will be output as several sheets in an Excel document.
- Run
main_internal_args.py
from the terminal to start the process. Variables to control results can be edited within themain_internal_args.py
.
-
The file that is output will have a filename like
{query_name}_{date}_{number_of_results}res.xlsx
. -
The 'Master Table' will have non-flattened data. Most of the data are in lists but there is also a dictionary.
- The dataframe used to make this table would be ideal for further work with the specific data pulled from a run.
-
After that, there are a number of flattened tables that are relevant to a specific feature.
-
These tables are easier to read but are more difficult to use to gain information on an individual article.
-
These features are:
- Author
- Keyword
- Article ID
- Abstract
- Pubtype
- MeSH Keywords
Scripts run in the following order:
main_internal_args.py
search_ids.py
pubmed_scraper.py
- This script doesn't have a function call within it.
- User can set run parameters here.
- query_term
- Term/phrase that will be used to query PubMed
- i.e.:
- translational+AND+microbiome
- cannabis+AND+(inflammation+OR+nausea)
- sort_order
- Sort order to return articles in.
- Not important if pulling ALL articles that match a term.
- Important when pulling < ALL articles because some need to be left out.
- Options include :
- Default value is 'most+recent'
- Others are 'journal', 'pub+date', 'relevance', 'title', 'author'
- results
- Number of results to return from the search
- No upper limit. Script will stall longer with more results.
- If it returns < results results, that's all the articles on PubMed
- id_list_filename
- Name of file with list of IDs to run.
- Leave blank if using a search query.
- query_term
-
This script will take in a list of article IDs or a search query, create an .XML file with data on the specific articles found and return some info to be passed to the next function.
-
Function : get_article_ids()
- Parameters :
- query
- Query that'll be searched.
- filename
- Filename of an ID list. Okay to leave blank while using a search query.
- retmax
- Number of results to return.
- sort
- Sort order for the results to be sorted using.
- have_ids
- Boolean to tell script whether to use an ID list or not. Defaults to
False
- Boolean to tell script whether to use an ID list or not. Defaults to
- api_key
- API key is not necessary to run but will help with large runs.
- Key increases access rate from 3 requests / second to 10 requests / second
- query
- Returns:
- A list of 2 elements : [file_name_fetch, query_str]
- file_name_fetch
- Name of the "fetch" .XML file made during the run.
- query_str
- The query fed into the function at the beginning.
- file_name_fetch
- A list of 2 elements : [file_name_fetch, query_str]
- Parameters :
-
This file will take in the filename for the fetch .XML file generated by
search_ids.py
, output a .xlsx file and return a string with run information. -
Function: pubmed_xml_parse()
- Parameters:
- filename
- Filename of the .XML 'fetch' file created by
get_article_ids()
- Filename of the .XML 'fetch' file created by
- filename
- Returns:
- return_string
- A string made with the filename and number of results.
- Will be printed in the console after a successful run.
- return_string
- Parameters:
Data in data
folder is a list of journal's impact factors and a few other stats for each journal. Data is from 2018 and downloaded from InCites Journal Citations Report. The table is not on github as it was acquired through a school proxy. The link for the data is here but it is behind a paywall.