-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace base split by regex split to fix ; inside quotes #215
base: master
Are you sure you want to change the base?
Conversation
Thanks for this. I'm worried about the performance hit though. Typically, the string splitting is what takes the most time when creating a db, and python regular expressions are notoriously slow. This benchmarking code: import re
import timeit
r = re.compile(f''';(?=(?:[^"]|"[^"]*")*$)''')
a = '''gene_id "BSU_00010"; transcript_id "unassigned_transcript_1"; db_xref "EnsemblGenomes-Gn:BSU00010"; db_xref "EnsemblGenomes-Tr:CAB11777"; db_xref "GOA:P05648"; db_xref "InterPro:IPR001957"; db_xref "InterPro:IPR003593"; db_xref "InterPro:IPR010921"; db_xref "InterPro:IPR013159"; db_xref "InterPro:IPR013317"; db_xref "InterPro:IPR018312"; '''
def split(a):
a.split()
def regex(a):
r.split(a)
print("split: " , timeit.Timer(f"split('{a}')", "from __main__ import split").timeit())
print("regex: " , timeit.Timer(f"regex('{a}')", "from __main__ import regex").timeit()) gives this on my laptop:
So for this example, the regex is 30x slower! That means a human GTF, instead of taking ~15 min to build a db, would take over 7 hrs. And based on some initial testing, the gap between the methods grows with increasing attribute string length. Unfortunately without getting the performance closer to what Maybe there could be a fallback option, where the regex only happens under certain circumstances? I think the issue in #212 is not hit until after parsing, but maybe there would be some way of detecting, at parse time, cases where it should try harder with the regex. |
That makes sense, I had not thought about the speed implications of regex. An alternative solution is to allow for this using an additional term in the dialect. Again, given that this is an edge case, I didn't want to slow everything down for this - so I followed the logic in the wormbase_gff2 example where the user needs to infer the dialect using The changes I have made are:
Let me know what you think. |
This PR fixes #212 where semicolons inside quoted string values were causing issues. I was not able to replicate the
AttributeStringError
mentioned in the issue since the usage was not specified but this is what the outputs look like before and after the change. I used the file mentioned in the issue for this test.BEFORE:
AFTER:
Let me know what you think!