Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e_forward() not returning results without IDs named 'src' and 'dest' #614

Open
DataBoyTX opened this issue Nov 20, 2024 · 7 comments
Open

Comments

@DataBoyTX
Copy link
Contributor

DataBoyTX commented Nov 20, 2024

Describe the bug

Customer reported bug with GFQL's e_forward() not returning the correct results unless the edge df IDs are named 'src' and 'dest'

To Reproduce

import graphistry
import pandas as pd
from graphistry import (
    # graph operators
    n, e_undirected, e_forward, e_reverse, e,
    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)
import pandas as pd

graphistry.register(...)

edges_df = pd.read_csv('blueprint_edges.csv')
nodes_df = pd.read_csv('blueprint_nodes.csv')

edges_df['src'] = edges_df['parent_elid']
edges_df['dest'] = edges_df['elid']

# create two different graphs with the same cols, just named differently: 
g = graphistry.edges(edges_df, 'parent_elid', 'elid').nodes(nodes_df, 'elid') 
g2 = graphistry.edges(edges_df, 'src', 'dest').nodes(nodes_df, 'elid') 

# Test: 
# expected behavior: be able to use e_forward to get edges
# observed behavior: get no edges back unless using e() or e_undirected()
def test_eforward(g):
    guc = g.chain([
        n({"elid":"905e3174aa"}),
        e_forward(),
        n()
    ])

    print('nodes:\n', guc._nodes, '\n' )
    print('edges:\n', guc._edges)


test_eforward(g)
test_eforward(g2)

# notice that the first call with 'g' does not return any edges, where the call with 'g2' does 

blueprint_nodes.zip

Expected behavior

be able to use e_forward to get edges

Actual behavior

get no edges back unless using e() or e_undirected(), or change the IDs to src and dest

Screenshot showing differences of counts with same column contents for IDs, but different names:

image

Graphistry GPU server environment

Hub v2.41.10

PyGraphistry API client environment

  • Where run Jupyter Lab local
  • Version 0.34.17
  • Python Version Python 3.8.5
@DataBoyTX DataBoyTX added the bug label Nov 20, 2024
@lmeyerov
Copy link
Contributor

@DataBoyTX any way to know if they have additional surprisingly named columns? we don't assume hard-coded names, so the issue is generally excess results, not missing...

@DataBoyTX
Copy link
Contributor Author

@lmeyerov - I am not sure what you mean by surprisingly named columns, the column names that don't work are: parent_elid for src ID and elid for dest ID, that returns 0 results, but using src and dest IDs named src and dest does work.

Also curious what you mean by we don't assume hard-coded names?

@lmeyerov
Copy link
Contributor

lmeyerov commented Nov 26, 2024

elid is not a variable name in this repo so I don't know why it would collide, so I'm wondering if their data has more columns , and the bug report does not provide the full schema

@DataBoyTX
Copy link
Contributor Author

@lmeyerov - sorry I missed this comment. I added the link to the dataset above a few weeks ago, there's elid in both the files, and the source code (renamed to src/dest for the edges to show that it works), but it's definitely there.

data file (also in description above): link

$  ls -lh
total 548K
-rwxrwxrwx 1 root root 118K Dec 10 00:58 blueprint_edges.csv
-rwxrwxrwx 1 root root 426K Dec 10 00:58 blueprint_nodes.csv

$  head -n 1 *
==> blueprint_edges.csv <==
elid,parent_elid,depth,bottom_flag,top_flag

==> blueprint_nodes.csv <==
elid,ui_role,bottom_flag

@lmeyerov
Copy link
Contributor

This is fascinating, I think the actual issue is when the node id col matches one of the edge src/dst col names:

g = graphistry.nodes(df1, n).edges(df2, e1, e2)
assert g._node != g._source
assert g._node != g._destination

This is surprising behavior, so digging in

@lmeyerov
Copy link
Contributor

Yep that's it:

=================================== FAILURES ===================================
____________________________ test_hop_binding_reuse ____________________________

    def test_hop_binding_reuse():
        edges_df = pd.DataFrame({'s': ['a', 'b'], 'd': ['b', 'c']})
        nodes1_df = pd.DataFrame({'v': ['a', 'b', 'c']})
        nodes2_df = pd.DataFrame({'s': ['a', 'b', 'c']})
        nodes3_df = pd.DataFrame({'d': ['a', 'b', 'c']})
    
        g1 = CGFull().nodes(nodes1_df, 'v').edges(edges_df, 's', 'd')
        g2 = CGFull().nodes(nodes2_df, 's').edges(edges_df, 's', 'd')
        g3 = CGFull().nodes(nodes3_df, 'd').edges(edges_df, 's', 'd')
    
        g1_hop = g1.hop()
        g2_hop = g2.hop()
        g3_hop = g3.hop()
    
        assert g1_hop._nodes.shape == g2_hop._nodes.shape
        assert g1_hop._edges.shape == g2_hop._edges.shape
>       assert g1_hop._nodes.shape == g3_hop._nodes.shape
E       assert (3, 1) == (2, 1)
E         
E         At index 0 diff: 3 != 2
E         
E         Full diff:
E           (
E         -     2,
E         ?     ^
E         +     3,
E         ?     ^
E               1,
E           )

@lmeyerov
Copy link
Contributor

lmeyerov commented Dec 24, 2024

This can likely be solved through renaming src/dst col names during .hop() (see test in #623) . It's a bit tricky because predicates may use the original column name. I gave it a go and was a bit trickier than hoped, so backed out and switched to an interim workaround of detecting the issue and raising an informative exception.

Longer-term, we should be using symbol tables for names to avoid this kind of issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants