Hash Functions

In simple terms, a hash function a mapping between a space of arbitrarily sized objects to a finite space.

A famous example of hash functions are the cryptographic hash functions, which map the set of all strings (which is infinite) to a finite set of fixed-length strings.

These functions have the property that identical input strings will map to identical outputs, and provide good probabilistic guarantees that different input strings will not map to the same output (collision resistance).

Here’s an example of a cryptographic hash function:

>>> import hashlib
>>> m = hashlib.sha256(b"Hello world!").hexdigest()
c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a

Now if we change the string by a tiny amount, we should get a drastically different hash.

>>> m = hashlib.sha256(b"hello world!").hexdigest()
7509e5bda0c762d2bac7f90d758b5b2263fa01ccbc542ab5e3df163be08e6ca9

The sha256 function is able to take strings of any size and map them to a 256 bit string (i.e. to a space of $2^{256}$ possible values). Obviously that is a finite space so there is a possibility that collisions will occur, but it is quite miniscule for practical purposes.

Hash functions can be very useful for storing and organizing data.

In this post, I’ll discuss one algorithm for hashing objects which are more complex than linear strings, namely graphs.

A graph is just a set of relations between objects (nodes), and thus it doesnt have an ordering. If we rearrange the characters in a string we get different strings, but if we rearrange nodes in the same graph, we still have the same object.

This invariance is something that a good graph hash function has to capture.

The Weisfeiler-Lehman isomorphism test is a powerful way to check if two graphs are isomorphic (they are structually identical).

This algorithm turns out to be an important foundation of the theory behind the popular Graph Neural Networks used in AI.

Here, I present a hash function which is based on the WL test, which instead of returning True/False for a pair of graphs depending on their structural equivalence, it takes as input a single graph and outputs a string code which will be the same for all graphs which are isomorphic to it.

Neighbourhood aggregation

The fundamental operation in the WL test (and hash version) is the neighbourhood aggregation.

That is, given a node in a graph, we collect information from its neighbours.

Obviously, graphs with different structures will have different local neighbourhoods, so this operation is sensitive to graph structure.

The important part is therefore, to make the operation invariant to permutations of the neighbors.

So, let’s say we have a graph G, a node n, and a dictionary with labels for each node.

The node label can be either the degree of the node, or some kind of hashable node attribute (e.g. name of a person in a social network, element type in a molecule, etc.).

We can also have labels on the edges (see full implementation at the end for that).

So our starting graph will always have some kind of label in the nodes, and optionally a label on the edges.

Of course, the label has to have some graph-level meaning, and not just be some kind of unique identifier which might change across equivalent graphs.)

Here’s how we obtain a new label for n.

    def nei_agg(G, n, node_labels, edge_attr=None):
	"""
	Aggregate neighbour labels for node `n`
	"""
	#start with the current node's label
        x = [node_labels[n]]
        for nei in G.neighbors(n):
            x.append(node_labels[nei])
        return ''.join(sorted(x))

The key operation is the sorted at the return statement, this ensures that no matter how the neighbors are ordered, we always obtain the same label in the end.

I’m using the popular Python network library, networkx here.

Node updating

Now we just repeat the same process for each node in the graph, to obtain a new dictionary of node labels.

    def wl_step(G, labels, edge_attr=None, node_attr=None):
        """
            Aggregate neighbor labels and edge label.
        """
        new_labels = dict()
        for n in G.nodes():
            new_labels[n] = nei_agg(G, n, labels, edge_attr=edge_attr)
        return new_labels

Graph hashing

Finally, we repeat the WL step a number of times, each time aggregating node info from larger distances.

An node in a one-step WL would get labels from its neighbors, a two-step from its neighbors and its neighbors neighbors, and so on..

The more steps we do the more resolution we get in the resulting hash.

So after doing a WL step, we have a dictionary of node labels and now we want a final hash value, again in a permutation invariant manner.

Since the node labels are just strings, we can apply hash functions that work on strings that we saw above to get a fixed-size encoding.

At each WL iteration, we then count the number of times each label occured as a histogram.

Histograms are also permutation invariant.

The full hash is then a concatenation of histograms, one for each iteration of the WL update.

Finally for real, we hash the histogram to get a fixed size representation.


def wl_hash():
    #the label of each node is its degree (for simple graphs)
    node_labels = {str(G.degree(n)) for n in G.nodes()}
    for k in range(iterations):
        node_labels = wl_step(G, node_labels)
        c = Counter()
        # count node labels (histogram)
        for node, d in node_labels.items():
            h = blake2b(digest_size=digest_size)
            h.update(d.encode('ascii'))
            c.update([h.hexdigest()])
        # sort the counter, extend total counts
        items.extend(sorted(c.items(), key=lambda x: x[0]))

    # hash the final counter
    h = blake2b(digest_size=digest_size)
    h.update(str(tuple(items)).encode('ascii'))
    h = h.hexdigest()
    return h

Examples

These are two isomorphic graphs, they are a triangle with a tail.

Even though they have different node IDs, their hash should be identical.

>>> import networkx as nx
>>> G1 = nx.Graph()
>>> G1.add_edges_from([(1, 2),\
		      (2, 3),\
                      (3, 1),\
                      (1, 4)])
>>> G2 = nx.Graph()
>>> G2.add_edges_from([(5,6),\
                       (6,7),\
                       (7,5),\
                       (7,8)])
>>> wl_hash(G1)
0db442538bb6dc81d675bd94e6ebb7ca
>>> wl_hash(G2)
0db442538bb6dc81d675bd94e6ebb7ca
"""

Now let’s remove an edge from the first graph.

>>> G1.remove_edge(3, 1)
>>> wl_hash(G1)
ee902cfa5889b4b918e056864319a871

We get a completely different hash.

Below is a full implementation that lets you have edge and node-labeled graphs.

Full code


"""
    Functions for hashing graphs to strings.
    Isomorphic graphs should be assigned identical hashes.
    For now, only Weisfeiler-Lehman hashing is implemented.
"""

from collections import Counter
from hashlib import blake2b


def wl_hash(G, edge_attr=None, node_attr=None, iterations=3, digest_size=16):
    """
    Returns WL hash of a graph.
    We iteratively aggregate and hash neighbourhoods of each node.
    After each node's neighbors are hashed to obtain updated node labels,
    we hash a histogram of resulting labels as the
    final hash.
    Implementation of
    (http://jmlr.org/papers/volume12/shervashidze11a/shervashidze11a.pdf)
    Parameters
    ----------
    G: graph
        The graph to be hashed.
        Can have node and/or edge attributes. Can also have no attributes.
    edge_attr: string
        The key in edge attribute dictionary to be used for hashing.
        If None, edge labels are ignored.
    node_attr: string
        The key in node attribute dictionary to be used for hashing.
        If None, and no edge_attr given, use
        degree of node as label.
    iterations: int
        Number of neighbor aggregations to perform. Should be larger for larger graphs.
    digest_size: int
        Size of blake2b hash digest to use for hashing node labels.
    In example below,
    we have two triangle graphs with a tail node that
    are isomorphic except for edge labels.
    By specifying the edge_attr option, the graphs receive different hashes.
    Returns
    -------
    h : string
        Hexadecimal string corresponding to hash of the input graph.
    >>> import networkx as nx
    >>> G1 = nx.Graph()
    >>> G1.add_edges_from([(1, 2, {'label': 'A'}),\
                           (2, 3, {'label': 'A'}),\
                           (3, 1, {'label': 'A'}),\
                           (1, 4, {'label': 'B'})])
    >>> G2 = nx.Graph()
    >>> G2.add_edges_from([(5,6, {'label': 'B'}),\
                           (6,7, {'label': 'A'}),\
                           (7,5, {'label': 'A'}),\
                           (7,8, {'label': 'A'})])
    >>> wl_hash(G1) == wl_hash(G2)
    True
    >>> wl_hash(G1, edge_attr='label') == wl_hash(G2, edge_attr='label')
    False
    """

    def nei_agg(G, n, node_labels, edge_attr=None):
        x = [node_labels[n]]
        for nei in G.neighbors(n):
            prefix = "" if not edge_attr else G[n][nei][edge_attr]
            x.append(prefix + node_labels[nei])
        return ''.join(sorted(x))

    def wl_step(G, labels, edge_attr=None, node_attr=None):
        """
            Aggregate neighbor labels and edge label.
        """
        new_labels = dict()
        for n in G.nodes():
            new_labels[n] = nei_agg(G, n, labels, edge_attr=edge_attr)
        return new_labels

    items = []
    node_labels = dict()
    # set initial node labels
    for n in G.nodes():
        if (not node_attr) and (not edge_attr):
            node_labels[n] = str(G.degree(n))
        elif node_attr:
            node_labels[n] = str(G.nodes[n][node_attr])
        else:
            node_labels[n] = ''

    for k in range(iterations):
        node_labels = wl_step(G, node_labels, edge_attr=edge_attr)
        c = Counter()
        # count node labels
        for node, d in node_labels.items():
            h = blake2b(digest_size=digest_size)
            h.update(d.encode('ascii'))
            c.update([h.hexdigest()])
        # sort the counter, extend total counts
        items.extend(sorted(c.items(), key=lambda x: x[0]))

    # hash the final counter
    h = blake2b(digest_size=digest_size)
    h.update(str(tuple(items)).encode('ascii'))
    h = h.hexdigest()
    return h