-
-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added skeleton of package for python support #81
base: master
Are you sure you want to change the base?
Conversation
python support
from pyDataExtraction.commonTypes.Graph import Graph | ||
|
||
if __name__ == "__main__": | ||
graph_data1 = {"A": ["B", "C"], "B": ["A,C"], "C": ["A,D"], "D": ["A"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "A,C" and "A,D" must be separated?
graph_data1 = {"A": ["B", "C"], "B": ["A", "C"], "C": ["A", "D"], "D": ["A"]}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiline formatting of dictionary seems to be more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mainly be using black for formatting, and it seems to condense it down if a line is less than 88 characters.
also, the main is current just representing how this would be used python side, but this will be where typescript interfaces with the code, passing information and returning information to the typescript for visualization, so this is going to be reworked extensively once I can get a look at the data format passed from the PyEvaluationEngine.ts
file.
def __repr__(self): | ||
pass | ||
""" | ||
# NOTE: ran into issue Node object is not json serializable when ecapsulating in own class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a method in Node class to serialize the data into json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I'm just now starting to actually learn about serialization due to this project and a personal rust project.
could you point me in the direction of some documentation on how to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skewballfox you can simply create a dictionary of the desired form and then serialize it into JSON:
def to_json(self):
return json.dumps({'from': self.from_node, 'to': self.to})
(this is an example for the Edge
class, you can do this for Node
class too)
This is a crude approach, which is inconvenient for classes with many fields, but in this case, I think it is quite applicable. Also such a way you can use from_node
field name for the 'from'
JSON key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_node
would also be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@insolor this is similar to what I was doing in __repr__
but that definitely helps fix that issue in a way that doesn't involve mangling the language. I don't know why overloading from
was my first guess.
also I think I figured out why Node
wasn't serializable. It was because the __repr__
function doesn't seem to be recursive. as an example When calling print(graph)
that doesn't automatically behave as if print(node)
were called.
either I could come up with a generic method that does this in DataType
to avoid having to define it in every current and future dataType that uses other objects as components(if present), or we have to overload the function in those cases.
I think I might have an idea using __slots__
, I'll have to see though
@@ -0,0 +1,10 @@ | |||
from json import dumps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest using import json
to follow "Explicit is better than implicit" principle.
|
||
class Graph(DataType): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have empty lines here?
super().__init__() | ||
self.kind["graph"] = True | ||
# TODO get working for both a dictionary and an nxn array | ||
# self.nodes = [Node(node) for node in graph data] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have commented out code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented it out mainly to have a reference to the other possible implementation, and because I kept going back and forth between different implementations.
I would rather have Nodes and Edges encapsulated in their own class, in case we want to implement different types of graphs.
the commented out code was meant to be deleted after the class was complete. I can go ahead and delete it now though
# self.nodes = [Node(node) for node in graph data] | ||
for node in graph_data: | ||
self.nodes.append({"id": str(node)}) | ||
# TODO change prints to log statements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a too minor change to left a TODO for it and not implement it right away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's partially because my experience implementing logs across a library is none, I've only implemented them in a single file before, and didn't try to make it where the log statement was different across classes.
# print("edge: ", graph_data[node][edge_i]) | ||
self.edges.append({"from": node, "to": edge}) | ||
# edge_i += 1 | ||
# edge_i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why keep commented out code?
@@ -0,0 +1,24 @@ | |||
from json import dumps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest just import json
for the sake of readability
@@ -0,0 +1,24 @@ | |||
from json import dumps | |||
from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is abstractmethod
used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I didn't realize I had left that in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments mainly about code style
graph_data1 = {"A": ["B", "C"], "B": ["A,C"], "C": ["A,D"], "D": ["A"]} | ||
graph_data2 = {1: [2, 3], 2: [1, 3], 3: [1, 4], 4: [1]} | ||
graph = Graph(graph_data1) | ||
print(graph) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logging with respective log level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. I was mainly using __main__
to verify that the graph implementation works as I was writing it, though that should be move to a separate directory for testing, as seems to be the standard practice for python libraries.
I'm not sure if we could use the same test directory as node without pytest throwing errors due to the presence of js files. my experience with pytest and how it works is limited.
class Edge: | ||
|
||
def __init__(self,from: str, to: str,): | ||
self.fromnode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use self.from_node
""" | ||
class Edge: | ||
|
||
def __init__(self,from: str, to: str,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space after comma
self.fromnode | ||
|
||
def __repr__(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have an empty method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it needs to be implemented. This was mainly meant to be a skeleton of a library, make it easier for people to contribute to the development of one set of dataTypes. Also we can't use the dataType's __repr__
method for Edge or Node.
for one, json complains Node objects aren't serializable when calling __repr__
on graph.
two, from
is a syntax token, and also what is expected to be inside the json representation of a graph. so in Edge's case in particular, it's going to require us to either:
- do something hacky to use
from
as a keyword (not my preferred option), - parse the json representation of it's dict and change every instance of whatever variable we used to represent
from
- handle all the edge related stuff inside graph(would rather avoid that)
def __init__(self, id: Union[int, str], label: Optional[str] = None): | ||
super().__init__() | ||
self.id = id | ||
if label is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest simplifying to:
self.lable = lable or id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is actually a hella useful feature I didn't know about. Thanks for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, given the implementation, and what we are passing the information to, this likely wouldn't work, as the attribute would still be listed in the json output(at least with the method currently used to produce json.
I'm not sure if this would be the case, but I'm trying to avoid all of the nodes being labeled None
in the visualizer. So, I set it up to where attributes were only present in the case where a value was assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example? I'm not able to understand.
As from what I think if we go with
self.label = label or id
If anyone doesn't pass label
the default is None
and the self.label
will be populated by id
.
Correct?
# self.nodes = [Node(node) for node in graph data] | ||
self.nodes = [] | ||
self.edges = [] | ||
if isinstance(graph_data, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not check the type here. We expect graph data to be a dictionary otherwise we should have exception not silence it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, no, the reason being is not counting classes for graphs(just too much trouble to worry about at this moment), you can have a graph represented as a dict or 2d array.
I intended after getting the dictionary implementation working to check else isinstance(graph_data, list)
or something similar for explicitly a list of list.
# print(edge_i) | ||
# print("edge: ", graph_data[node][edge_i]) | ||
self.edges.append({"from": node, "to": edge}) | ||
# edge_i += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for node, edge in graph_data.items():
self.node.append({"from": node, "to": edge})
self.edges.append({"from": node, "to": edge})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested in ipython
for node, edge in graph_data1.items():
print(node," : ", edge)
output:
A : ['B', 'C']
B : ['A', 'C']
C : ['A', 'D']
D : ['A']
you could do so with a second loop for edge in edges
|
||
|
||
Args: | ||
DataType (Union[Dict[str,list],Dict[int,list]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add speces after commas
from typing import Union, Dict, Optional | ||
from abc import ABC, abstractmethod | ||
from pyDataExtraction.commonTypes.Graph import Graph | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can run flake8(for linting) and black(code formatting) in local(console) to do all such changes automatically and consistency will be maintained for all the contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently use black for formatting, though I've never used flake8, nor have I used black from the commandline. I know there's a precommit plugin for git that would call both to on attempted commits to make sure that they are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just have to install black in local(or in the virtual environment of python) and run black <file_name>
This will do the formatting for the file.
And about flake8, It's just that we can have a consistent style guide for python over this project which follows pep-8 guidelines.
As some of the comments by @mvoitko is regarding this only.
Just a suggestion 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments
self.kind["text"] = True | ||
self.text = text_data | ||
if mimeType is None: | ||
self.mimeType = mimeType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to achieve here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, that was probably a missing not
self.text = text_data | ||
if mimeType is None: | ||
self.mimeType = mimeType | ||
if fileName is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you wanted the opposite:
if file_name:
self.file_name = file_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be when dealing with edge cases far removed from this or a leftover from python 2, but from what I keep reading, using is None
seems to be preferred when dealing with values that may be None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, while editing this, I just remembered why I checked. because this library currently relies on json.dumps(self.__dict__)
to print the json representation of an object in a format the plugin expects, I was only conditionally instantiating these variables.
they weren't part of the object unless they were explicitly added to the object.
this may be another reason to either find a better method of creating json objects, or use super to get the inherited json
representation and return a version altered in some way.
def __init__( | ||
self, | ||
text_data: str, | ||
mimeType: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep it to the pep-8 python style guide:
mime_type,
file_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the variable names are mainly because of the use case and the method currently for doing so.
right now I'm directly converting the respective dataTypes to json
objects when they are printed or called as a string, and variables are the displayed variables inside the json
string, so I've been matching the case of the specification listed on the Readme.
Though since we already need to do some manipulations on the json representation before returning in the case of the Edge
class, doing so here shouldn't be a major issue.
btw if we got rid of the reliance on __dict__
, we could use __slots__
to slightly improve performance and reduce memory usage.
def __init__( | ||
self, | ||
text_data: str, | ||
mimeType: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please name params in snake case:
mime_type, file_name
def __init__( | ||
self, | ||
text_data: str, | ||
mimeType: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use snake case:
mime_type, file_name
I added a skeleton of a python library that can be fleshed out to handle the currently supported dataTypes. If I need to make any changes I can, this isn't meant to be working at the current stage so much as provide a bit of structure that makes contributions covering one aspect of functionality easier.