Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max weighted matching #107

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jpweytjens
Copy link
Contributor

This pull request implements max weighted bipartite graph matching as a method for the OneToOne class. It uses networkx for the actual matching algorithm. Multiple helper functions are defined to transform the MultiIndex output from recordlinkage to a graph object and back.

@jpweytjens
Copy link
Contributor Author

You can easily visualize (small) graphs with the following code

def draw_weighted_bipartite_graph(graph):
    """Draw a weighted bipartite graph. 
    No garantuees are made about the order of the nodes."""

    labels = nx.get_edge_attributes(graph, "weight")
    nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0]
    pos = nx.drawing.layout.bipartite_layout(graph, nodes)

    nx.draw_networkx_edge_labels(graph, pos, labels)
    nx.draw_networkx(graph, pos)

though I'm not sure recordlinkage should contain such a function?

@codecov
Copy link

codecov bot commented Jul 8, 2019

Codecov Report

Merging #107 into master will increase coverage by 0.54%.
The diff coverage is 95.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage    77.2%   77.75%   +0.54%     
==========================================
  Files          33       33              
  Lines        2338     2400      +62     
  Branches      376      391      +15     
==========================================
+ Hits         1805     1866      +61     
+ Misses        408      407       -1     
- Partials      125      127       +2
Impacted Files Coverage Δ
recordlinkage/network.py 91.96% <95.71%> (+9.27%) ⬆️
recordlinkage/base.py 80.31% <0%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87a5f4a...470985f. Read the comment docs.

@jpweytjens
Copy link
Contributor Author

If I find the time, I'm interesting in adding the symmetric best matching and stable matching methods from Franke et. al as well. These methods seem to produce comparable results, for a much lower computational cost.

Copy link
Owner

@J535D165 J535D165 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for this well written Pull Request. A few change requests and remarks below.

recordlinkage/network.py Show resolved Hide resolved
recordlinkage/network.py Outdated Show resolved Hide resolved
recordlinkage/network.py Outdated Show resolved Hide resolved
recordlinkage/network.py Outdated Show resolved Hide resolved
recordlinkage/network.py Outdated Show resolved Hide resolved
@J535D165
Copy link
Owner

Thanks @jpweytjens

You can easily visualize (small) graphs with the following code

def draw_weighted_bipartite_graph(graph):
    """Draw a weighted bipartite graph. 
    No garantuees are made about the order of the nodes."""

    labels = nx.get_edge_attributes(graph, "weight")
    nodes = [n for n, d in graph.nodes(data=True) if d["bipartite"] == 0]
    pos = nx.drawing.layout.bipartite_layout(graph, nodes)

    nx.draw_networkx_edge_labels(graph, pos, labels)
    nx.draw_networkx(graph, pos)

though I'm not sure recordlinkage should contain such a function?

That would be nice to have in the documentation or on the examples page. If it turns out to be a succes, we add it to the toolkit.

Documentation: https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html?highlight=onetoone#network

@jpweytjens
Copy link
Contributor Author

jpweytjens commented Nov 27, 2019

I've added a new commit that should address all the requested changes.

  • A new function add_weights is introduced to add weights to candidate matches with 3 different methods. This function returns a Series instead of DataFrame.
  • Series are now used throughout the max_weighted matching function, instead of DataFrames.
  • networkx has been removed from tox and added to setup.py
  • Documentation has been added for the max_weighted method and an example is given for the new add_weights function.

I'm curious to hear your thoughts on this new version. If you're happy with them, I would like to add a test for the max_weighted method.

@jpweytjens
Copy link
Contributor Author

Can you review the changes @J535D165 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants