Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Adaptation of sparql_dataframe to Wikidata #10

Open
lbocken opened this issue Jul 26, 2021 · 7 comments
Open

Adaptation of sparql_dataframe to Wikidata #10

lbocken opened this issue Jul 26, 2021 · 7 comments

Comments

@lbocken
Copy link

lbocken commented Jul 26, 2021

Hello,

I am trying to extract dataframes from queries in Wikidata.

For instance, this code from an example in Wikidata works to extract dictionary of countries:

`# pip install sparqlwrapper # https://rdflib.github.io/sparqlwrapper/
import sparql_dataframe
import sys
from SPARQLWrapper import SPARQLWrapper, JSON

endpoint_url = "https://query.wikidata.org/sparql"

query = """#Countries
SELECT ?item ?itemLabel
WHERE
{
?item wdt:P31 wd:Q6256.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}"""

def get_results(endpoint_url, query):
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
# TODO adjust user agent; see https://w.wiki/CX6
sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
return sparql.query().convert()

results = get_results(endpoint_url, query)

for result in results["results"]["bindings"]:
print(result)
`

When I do that :
df = sparql_dataframe.get(endpoint_url, query)

I receive this error:

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py:1315: RuntimeWarning: Format requested was CSV, but XML (application/sparql-results+xml;charset=utf-8) has been returned by the endpoint
warnings.warn(message % (requested.upper(), format_name, mime), RuntimeWarning)


AttributeError Traceback (most recent call last)
in
----> 1 df = sparql_dataframe.get(endpoint_url, query)

C:\ProgramData\Anaconda3\lib\site-packages\sparql_dataframe\sparql_dataframe.py in get_sparql_dataframe(endpoint, query, post)
28 sparql.setReturnFormat(CSV)
29 results = sparql.query().convert()
---> 30 _csv = StringIO(results.decode('utf-8'))
31 return pd.read_csv(_csv, sep=",")

AttributeError: 'Document' object has no attribute 'decode'

@lawlesst
Copy link
Owner

Hello,

Try passing post=True. E.g.:

sparql_dataframe.get(endpoint_url, query, post=True)

You can see in the unit tests that queries against Wikidata should work fine with post=True: https://github.com/lawlesst/sparql-dataframe/blob/master/tests/test_sparql_dataframe.py#L65

@lbocken
Copy link
Author

lbocken commented Jul 26, 2021


HTTPError Traceback (most recent call last)
in
----> 1 df = sparql_dataframe.get(endpoint_url, query, post = True)
2 df

C:\ProgramData\Anaconda3\lib\site-packages\sparql_dataframe\sparql_dataframe.py in get_sparql_dataframe(endpoint, query, post)
27
28 sparql.setReturnFormat(CSV)
---> 29 results = sparql.query().convert()
30 _csv = StringIO(results.decode('utf-8'))
31 return pd.read_csv(_csv, sep=",")

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in query(self)
1105 :rtype: :class:QueryResult instance
1106 """
-> 1107 return QueryResult(self._query())
1108
1109 def queryAndConvert(self):

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in _query(self)
1085 raise EndPointInternalError(e.read())
1086 else:
-> 1087 raise e
1088
1089 def query(self):

C:\ProgramData\Anaconda3\lib\site-packages\SPARQLWrapper\Wrapper.py in _query(self)
1071 response = urlopener(request, timeout=self.timeout)
1072 else:
-> 1073 response = urlopener(request)
1074 return response, self.returnFormat
1075 except urllib.error.HTTPError as e:

C:\ProgramData\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
220 else:
221 opener = _opener
--> 222 return opener.open(url, data, timeout)
223
224 def install_opener(opener):

C:\ProgramData\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
529 for processor in self.process_response.get(protocol, []):
530 meth = getattr(processor, meth_name)
--> 531 response = meth(req, response)
532
533 return response

C:\ProgramData\Anaconda3\lib\urllib\request.py in http_response(self, request, response)
638 # request was successfully received, understood, and accepted.
639 if not (200 <= code < 300):
--> 640 response = self.parent.error(
641 'http', request, response, code, msg, hdrs)
642

C:\ProgramData\Anaconda3\lib\urllib\request.py in error(self, proto, *args)
567 if http_err:
568 args = (dict, 'default', 'http_error_default') + orig_args
--> 569 return self._call_chain(*args)
570
571 # XXX probably also want an abstract factory that knows when it makes

C:\ProgramData\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
500 for handler in handlers:
501 func = getattr(handler, meth_name)
--> 502 result = func(*args)
503 if result is not None:
504 return result

C:\ProgramData\Anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
647 class HTTPDefaultErrorHandler(BaseHandler):
648 def http_error_default(self, req, fp, code, msg, hdrs):
--> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp)
650
651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

@lawlesst
Copy link
Owner

I think that's an error returned by the actual Wikidata SPARQL endpoint. It aggressively rate limits.

@lbockenrs
Copy link

Hello,

Try passing post=True. E.g.:

sparql_dataframe.get(endpoint_url, query, post=True)

You can see in the unit tests that queries against Wikidata should work fine with post=True: https://github.com/lawlesst/sparql-dataframe/blob/master/tests/test_sparql_dataframe.py#L65

How would you read the query saved into a separate file? Thanks for your help !

@lawlesst
Copy link
Owner

If your queries are saved in a text file, then you would just read them in like any other text file in Python and save them to a query variable that you would use with sparql_dataframe.get.

Here's a tutorial on reading and writing files in Python: https://realpython.com/read-write-files-python/#reading-and-writing-opened-files

@lbockenrs
Copy link

lbockenrs commented Nov 22, 2021

This works :

import sparql_dataframe endpoint_url = "https://query.wikidata.org/sparql" with open('query.rq', 'r') as file: query = file.read() df = sparql_dataframe.get(endpoint_url, query, post = True) df

@hbruch
Copy link

hbruch commented Feb 26, 2023

Just had the same issue issue querying wikidata. First thought, it might be caused by a version change (SPARQLWrapper was installed in version 2.0.0). It now already contains get_sparql_dataframe, so the code below was successful.

Nevertheless, thanks for creating this lib which made it directly into the wrapper!

from SPARQLWrapper import get_sparql_dataframe

endpoint = "https://query.wikidata.org/sparql"

query = """#Countries
SELECT ?item ?itemLabel
WHERE {
  ?item wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
"""
df = get_sparql_dataframe(endpoint, query)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants