profile
viewpoint
Huon Wilson huonw Sydney, Australia http://huonw.github.io/ @data61. Formerly: Swift frontend @apple, core team @rust-lang.

Gankra/collect-rs 66

Miscellaneous Collections

huonw/brainfuck_macros 56

A brainfuck procedural macro for Rust

brendanzab/algebra 54

Abstract algebra for Rust (still very much a WIP!)

huonw/2048-4D 19

A small clone of 1024 (https://play.google.com/store/apps/details?id=com.veewo.a1024)

brendanzab/rusp 14

A minimal scripting and data language for Rust.

huonw/alias 10

alias offers some basic ways to mutate data while aliased.

huonw/cfor 8

A C-style for loop macro

huonw/char-iter 3

An iterator over a linear range of characters.

Pull request review commentstellargraph/stellargraph

Enhancement/1591 de duplicate neo4j clucter gcn

 def to_adjacency_matrix(self, node_ids):     def is_directed(self):         return self._is_directed +    def check_graph_for_ml(self):+        """+        Checks if all properties required for machine learning training/inference are set up.+        An error will be raised if the graph is not correctly setup.+        """+        num_nodes_with_feats_query = f"""+            MATCH(n)+            WHERE "features" in keys(n)+            return COUNT(*)+        """+        result = self.graph_db.run(num_nodes_with_feats_query)+        num_nodes_with_feats = result.data()[0]["COUNT(*)"]+        if num_nodes_with_feats == 0:+            raise RuntimeError(+                "This StellarGraph has no numeric feature attributes for nodes"+                "Node features are required for machine learning"+            )++    def unique_node_type(self, error_message=None):+        """+        Return the unique node type, for a homogeneous-node graph.++        Args:+            error_message (str, optional): a custom message to use for the exception; this can use+                the ``%(found)s`` placeholder to insert the real sequence of node types.++        Returns:+            If this graph has only one node type, this returns that node type, otherwise it raises a+            ``ValueError`` exception.+        """+        node_type_query = f"""MATCH (n) RETURN distinct labels(n)"""+        result = self.graph_db.run(node_type_query)+        all_types = result.data()[0]["labels(n)"]+        if len(all_types) == 1:+            return all_types[0]

Is it a problem that the labels returns a list of individual labels (because Neo4j allows multiple labels per node), and so this is likely to be returning a list?

kieranricardo

comment created time in 2 hours

push eventstellargraph/stellargraph

Huon Wilson

commit sha 9204034b914b563192a376b8a44687c510df3ba9

Add a demo for fine-tuning a DGI-trained model (#1587) This adds a demo that uses Deep Graph Infomax and GCN for semi-supervised training. In particular, it: - pre-trains a GCN model using Deep Graph Infomax - fine-tunes the trained model for node classification It does this in a _very_ data scarce setting: 8 training examples, with one or two examples of each class. To validate that using Deep Graph Infomax for pre-training is helpful, the demo also compares to a GCN model without pretraining, that is just trained on the 8 training examples directly. See: #1588

view details

push time in 2 hours

delete branch stellargraph/stellargraph

delete branch : feature/fine-tuning-dgi-demo

delete time in 2 hours

PR merged stellargraph/stellargraph

Reviewers
Add a demo for fine-tuning a DGI-trained model

This adds a demo that uses Deep Graph Infomax and GCN for semi-supervised training. In particular, it:

  • pre-trains a GCN model using Deep Graph Infomax
  • fine-tunes the trained model for node classification

It does this in a very data scarce setting: 8 training examples, with one or two examples of each class.

To validate that using Deep Graph Infomax for pre-training is helpful, the demo also compares to a GCN model without pretraining, that is just trained on the 8 training examples directly.

See: #1588

+887 -2

2 comments

7 changed files

huonw

pr closed time in 2 hours

pull request commentstellargraph/stellargraph

Share more code between knowledge graph models

I initially inherited from Layer but decided against it, because I could imagine one having a complicated scoring "layer" that was actually a full model containing sublayers (e.g. there's algorithms like ConvE that do some sort of convolutions for scoring).

I just have one question, does subclassing the scoring layers from KGScore and Layer intefer with loading/saving models?

No, I don't think so. AFAIK, the deserialisation either calls the models constructor or depickles the class instance; either way, the parent classes would get handled appropriately. Just so I can cover all bases, did you have specific concerns about the extra superclasses?

An empirical test:

import stellargraph as sg
import tensorflow as tf
import numpy as np

G, _, test, _ = sg.datasets.WN18RR().load()

gen = sg.mapper.KGTripleGenerator(G, batch_size=100)

complex = sg.layer.ComplEx(gen, embedding_dimension=10)

model = tf.keras.Model(*complex.in_out_tensors())

original_predictions = model.predict(gen.flow(test))

model.save("/tmp/complex")
loaded_model = tf.keras.models.load_model("/tmp/complex")

loaded_model.summary()

loaded_predictions = loaded_model.predict(gen.flow(test))
np.array_equal(original_predictions, loaded_predictions) # True

summary() output:

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_13 (InputLayer)           [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_14 (InputLayer)           [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_15 (InputLayer)           [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, 1, 20)        818860      input_13[0][0]                   
                                                                 input_15[0][0]                   
__________________________________________________________________________________________________
embedding_11 (Embedding)        (None, 1, 20)        220         input_14[0][0]                   
__________________________________________________________________________________________________
compl_ex_score_5 (ComplExScore) (None, 1)            0           embedding_10[0][0]               
                                                                 embedding_11[0][0]               
                                                                 embedding_10[1][0]               
==================================================================================================
Total params: 819,080
Trainable params: 819,080
Non-trainable params: 0
__________________________________________________________________________________________________
huonw

comment created time in 2 hours

push eventstellargraph/stellargraph

Huon Wilson

commit sha d4f708ad793ba447ba284172750095f97a928d8d

Rename to require_gpu

view details

push time in 3 hours

Pull request review commentstellargraph/stellargraph

Allow overriding the buildkite queue used for tests and notebooks

 services:       # We need to manually bring in buildkite-agent       # (https://github.com/buildkite-plugins/docker-compose-buildkite-plugin/issues/158)       - ${BUILDKITE_AGENT_BINARY_PATH:-/usr/bin/buildkite-agent}:/usr/bin/buildkite-agent+      # Needed for GPU g4dn.xlarge EC2 instance+      - /usr/local/cuda-10.1/:/usr/local/cuda10.1/+      - /var/lib/buildkite-agent/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/lib/:/var/lib/buildkite-agent/TensorRT-6.0.1.5/targets/x86_64-linux-gnu/lib/

Yes, but the versions here are static and controlled by us (https://github.com/stellargraph/stellar-devops/pull/374 https://github.com/stellargraph/stellar-devops/blob/develop/miscellaneous/buildkite-agent-gpu-cuda-bootstrap/buildkite-agent-gpu-cuda-bootstrap.sh ).

huonw

comment created time in 3 hours

issue commentstellargraph/stellargraph

Acceleration for (Directed)GraphSAGELinkGenerator

It's not directly answering the question (I'll let @kieranricardo talk to that), but if you're interested in training an algorithm in an unsupervised fashion, another approach might be DeepGraphInfomax (docs, demo). For instance: https://nbviewer.jupyter.org/gist/huonw/aabad84802783927c9bdd6190b04f713

guomk

comment created time in 3 hours

issue commentstellargraph/stellargraph

Allow the use of edge features for GAT

"Exploiting Edge Features for Graph Neural Networks" https://ieeexplore.ieee.org/document/8954414 / https://arxiv.org/abs/1809.02709 works through some potential approaches to this in more detail (one of the various EGNN variants there is potentially equivalent to the brief description in Eq (5) of the Neural Execution of Graph Algorithms paper).

kjun9

comment created time in 3 hours

Pull request review commentstellargraph/stellargraph

Add SlidingFeaturesNodeGenerator for time series/sequence data

 def __init__(         bias_regularizer=None,         bias_constraint=None,     ):+        if generator is not None:+            if not isinstance(generator, SlidingFeaturesNodeGenerator):+                raise ValueError(+                    f"generator: expected a SlidingFeaturesNodeGenerator, found {type(generator).__name__}"+                )++            if seq_len is not None or adj is not None:+                raise ValueError(+                    "expected only one of generator and (seq_len, adj) to be specified, found multiple"+                )++            adj = generator.graph.to_adjacency_matrix(weighted=True).todense()+            seq_len = generator.window_size

I'd prefer not to "contaminate" this PR too many with things that may require changes elsewhere (e.g. in the demo, and/or merge conflicts with #1560). In addition, I think there's a bit more work required here to make this class super clean, e.g.:

  • switching the generator parameter to be more usable (making seq_len etc. optional)
  • potentially even removing the seq_len parameters and just using generator
huonw

comment created time in 3 hours

Pull request review commentstellargraph/stellargraph

Add SlidingFeaturesNodeGenerator for time series/sequence data

+# -*- coding: utf-8 -*-+#+# Copyright 2020 Data61, CSIRO+#+# Licensed under the Apache License, Version 2.0 (the "License");+# you may not use this file except in compliance with the License.+# You may obtain a copy of the License at+#+#   http://www.apache.org/licenses/LICENSE-2.0+#+# Unless required by applicable law or agreed to in writing, software+# distributed under the License is distributed on an "AS IS" BASIS,+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+# See the License for the specific language governing permissions and+# limitations under the License.++__all__ = [+    "SlidingFeaturesNodeGenerator",+    "SlidingFeaturesNodeSequence",+]++import numpy as np+from . import Generator+from tensorflow.keras.utils import Sequence++from ..core.validation import require_integer_in_range+++class SlidingFeaturesNodeGenerator(Generator):+    def __init__(self, G, window_size, batch_size=1):+        require_integer_in_range(window_size, "window_size", min_val=1)+        require_integer_in_range(batch_size, "batch_size", min_val=1)++        self.graph = G++        node_type = G.unique_node_type(+            "G: expected a graph with a single node type, found a graph with node types: %(found)s"+        )+        self._features = G.node_features(node_type=node_type)++        self.window_size = window_size+        self._batch_size = batch_size++    def num_batch_dims(self):+        return 1++    def flow(self, sequence_iloc_slice, target_distance=None):

Oh, yeah, this whole class lacks documentation. I'll add.

huonw

comment created time in 3 hours

Pull request review commentstellargraph/stellargraph

Add SlidingFeaturesNodeGenerator for time series/sequence data

+# -*- coding: utf-8 -*-+#+# Copyright 2020 Data61, CSIRO+#+# Licensed under the Apache License, Version 2.0 (the "License");+# you may not use this file except in compliance with the License.+# You may obtain a copy of the License at+#+#   http://www.apache.org/licenses/LICENSE-2.0+#+# Unless required by applicable law or agreed to in writing, software+# distributed under the License is distributed on an "AS IS" BASIS,+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+# See the License for the specific language governing permissions and+# limitations under the License.++__all__ = [+    "SlidingFeaturesNodeGenerator",+    "SlidingFeaturesNodeSequence",+]++import numpy as np+from . import Generator+from tensorflow.keras.utils import Sequence++from ..core.validation import require_integer_in_range+++class SlidingFeaturesNodeGenerator(Generator):+    def __init__(self, G, window_size, batch_size=1):+        require_integer_in_range(window_size, "window_size", min_val=1)+        require_integer_in_range(batch_size, "batch_size", min_val=1)++        self.graph = G++        node_type = G.unique_node_type(+            "G: expected a graph with a single node type, found a graph with node types: %(found)s"+        )+        self._features = G.node_features(node_type=node_type)++        self.window_size = window_size+        self._batch_size = batch_size++    def num_batch_dims(self):+        return 1++    def flow(self, sequence_iloc_slice, target_distance=None):+        return SlidingFeaturesNodeSequence(+            self._features,+            self.window_size,+            self._batch_size,+            sequence_iloc_slice,+            target_distance,+        )+++class SlidingFeaturesNodeSequence(Sequence):+    def __init__(+        self, features, window_size, batch_size, sequence_iloc_slice, target_distance+    ):+        if target_distance is not None:+            require_integer_in_range(target_distance, "target_distance", min_val=1)++        if not isinstance(sequence_iloc_slice, slice):+            raise TypeError(+                f"sequence_iloc_slice: expected a slice(...) object, found {type(sequence_iloc_slice).__name__}"+            )++        if sequence_iloc_slice.step not in (None, 1):+            raise TypeError(+                f"sequence_iloc_slice: expected a slice object with a step = 1, found step = {sequence_iloc_slice.step}"+            )++        self._features = features[:, sequence_iloc_slice, ...]+        shape = self._features.shape+        self._num_nodes = shape[0]+        self._num_sequence_samples = shape[1]+        self._num_sequence_variates = shape[2:]++        self._window_size = window_size+        self._target_distance = target_distance+        self._batch_size = batch_size++        query_length = window_size + (0 if target_distance is None else target_distance)+        self._num_windows = self._num_sequence_samples - query_length + 1++        # if there's not enough data to fill one window, there's a problem!+        if self._num_windows <= 0:+            if target_distance is None:+                target_str = ""+            else:+                target_str = f" + target_distance={target_distance}"++            total_sequence_samples = features.shape[1]+            start, stop, step = sequence_iloc_slice.indices(total_sequence_samples)+            assert step == 1

Yeah, for readability:

  • it's an assert (rather than a real throw ValueError(...)) because it is just syncing with the code on line 69 (e.g. if that one is changed, this one should be too)
  • it's communicating that step is being ignored on purpose, because it has an uninteresting value

I'll add a brief comment.

huonw

comment created time in 3 hours

Pull request review commentstellargraph/stellargraph

Feature/1531 generalize cluster gcn

 def call(self, inputs):                 "Currently full-batch methods only support a batch dimension of one"             ) -        # Remove singleton batch dimension-        features = K.squeeze(features, 0)-        propagated_features = K.squeeze(propagated_features, 0)-         # Propagate the node features         A = As[0]-        output = (1 - self.teleport_probability) * K.dot(-            A, propagated_features-        ) + self.teleport_probability * features+        if K.is_sparse(A):+            propagated_features = K.squeeze(propagated_features, 0)

@kieranricardo did you ignore this on purpose? (Not a problem if you don't think it applies, but a reply at least would be appreciated...)

kieranricardo

comment created time in 5 hours

startedDrTimothyAldenDavis/GraphBLAS

started time in 2 days

pull request commentstellargraph/stellargraph

Add SlidingFeaturesNodeGenerator for time series/sequence data

This is currently a very naive implementation of the windowing, but one can do it smarter using strides, to be O(1). I am starting with this “obvious” form for ease of review of the whole architecture. I’m thinking we can upgrade to a fancier form in future as a dedicated PR focused on just that.

The current target selection scheme only works for predicting all elements of a time series from itself. This doesn't work if one wanted to use observations from one time series to predict another, or one had a multivariate time-series (#1527, #1580) and only wanted to predict a single variate.

Some ideas I had:

  • support taking a function to transform the target feature array into the target
  • potentially, to make the above more useful support target_distance being a slice or list in addition to an integer (e.g. have a single model learn 5, 15 and 60 minute ahead speed predictions)
  • support taking a whole extra targets array of (multivariate) time series values of shape A x T x B, where A and B are arbitrary (although A would usually be the number of nodes) and T is at least as large as the number of windows. The first window would have targets targets[:, 0, ...] and so on.
huonw

comment created time in 4 days

pull request commentstellargraph/stellargraph

Use interleaved real/imag embeddings for complex KG models

Good idea. I'll investigate that next week.

huonw

comment created time in 4 days

pull request commentstellargraph/stellargraph

Support multivariate time series in GCN-LSTM

As with #1564, this depends on other PRs but could be reviewed in parallel, to at least validate the approach is vaguely reasonable, and to discuss the TODOs, like whether they're important for this PR or could be done as later work.

huonw

comment created time in 4 days

pull request commentstellargraph/stellargraph

Change input data to (batch, nodes, timesteps) in GCN-LSTM

@kieranricardo I unrequested you, because you've already got a lot of other reviews from me, but if you're enthusiastic feel free to review (especially since you've reviewed the other GCN-LSTM work).

huonw

comment created time in 4 days

pull request commentstellargraph/stellargraph

Add SlidingFeaturesNodeGenerator for time series/sequence data

This builds on #1579, but can be reviewed before that lands. (Even if #1579 doesn't land, this just needs to be updated to swap some dimensions.)

huonw

comment created time in 4 days

pull request commentstellargraph/stellargraph

Share more code between knowledge graph models

This builds on #1572, and so isn't ready to land directly, but it could be at least briefly reviewed now, to parallelise any feedback.

huonw

comment created time in 4 days

pull request commentpython/cpython

bpo-40630: add tracemalloc.reset_peak

It's fine to land for 3.10; I was just a little disconcerted and unhappy that we missed the feature cut-off by days, and so will probably have to wait for an extra year for this to be released.

huonw

comment created time in 4 days

Pull request review commentpython/cpython

bpo-40630: add tracemalloc.reset_peak

 Functions    :mod:`tracemalloc` module as a tuple: ``(current: int, peak: int)``.  +.. function:: reset_peak()++   Set the peak size of memory blocks traced by the :mod:`tracemalloc` module+   to the current size.++   Do nothing if the :mod:`tracemalloc` module is not tracing memory+   allocations.++   This function only modifies the recorded peak size, and does not modify or+   clear any traces, unlike :func:`clear_traces`. Snapshots taken with+   :func:`take_snapshot` before a call to :func:`reset_peak` can be+   meaningfully compared to snapshots taken after the call.++   See also :func:`get_traced_memory`.+

Done.

huonw

comment created time in 4 days

push eventhuonw/cpython

Huon Wilson

commit sha e44ca731dc97433ba8c66caed77faafd8d11ef8c

Add versionadded directive

view details

push time in 4 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha cd616bdb9958fe0717e571f52c82b755df0b0b35

Support edge features in StellarGraph objects (#1574) This adds support for storing edge features in StellarGraph objects. This works by taking any non-special columns of the edges DataFrame(s) as feature columns, similar to nodes (except nodes have no special columns). In particular, features are columns other than `source_column`, `target_column`, `edge_weight_column` and (if specified) `edge_type_column`. This opens up the ability for us to have algorithms that use edge information, such as #1328 and #1329. This diff here is quite messy, but there's a few core changes: - moving the "features" support into `ElementData` instead of `NodeData` by changing `type_starts` that was previously `[(type_name, first_type_iloc), ...]` into `type_info` that is `[(type_name, type_features), ...]` the length (of the first axis) of the `type_features` array thus defines which nodes are of that type - making sure that the edges conversion code can create the appropriate NumPy arrays - duplicating `unique_node_type`, `node_feature_sizes`, `node_features` to the `edge` equivalent, by creating private versions that switch between them - updating the docs, including the Pandas demo See: #1327

view details

push time in 4 days

delete branch stellargraph/stellargraph

delete branch : feature/1327-edge-features

delete time in 4 days

PR merged stellargraph/stellargraph

Reviewers
Support edge features in StellarGraph objects

This adds support for storing edge features in StellarGraph objects. This works by taking any non-special columns of the edges DataFrame(s) as feature columns, similar to nodes (except nodes have no special columns). In particular, features are columns other than source_column, target_column, edge_weight_column and (if specified) edge_type_column.

This opens up the ability for us to have algorithms that use edge information, such as #1328 and #1329.

This diff here is quite messy, but there's a few core changes:

  • moving the "features" support into ElementData instead of NodeData by changing type_starts that was previously [(type_name, first_type_iloc), ...] into type_info that is [(type_name, type_features), ...] the length (of the first axis) of the type_features array thus defines which nodes are of that type
  • making sure that the edges conversion code can create the appropriate NumPy arrays
  • duplicating unique_node_type, node_feature_sizes, node_features to the edge equivalent, by creating private versions that switch between them
  • updating the docs, including the Pandas demo

See: #1327

+1016 -377

3 comments

7 changed files

huonw

pr closed time in 4 days

pull request commentstellargraph/stellargraph

Support multivariate time series in GCN-LSTM

Yeah, I agree it doesn't match the Keras convention, but with #1564, users will be able to use StellarGraph as input, and so won't have to deal with this as directly.

huonw

comment created time in 4 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 10865f038ac662e40267f47925fcd4129c30706d

Another merge conflict resolution failure

view details

push time in 4 days

pull request commentstellargraph/stellargraph

Use interleaved real/imag embeddings for complex KG models

I can't prove it with tracemalloc for some reason (maybe because im wrong?) but checkout this colab notebook and watch the RAM usage. When it hits the tf.complex step memory RAM usage doubles. I observe the same thing locally which makes me think that tf.complex is copying the tensors around.

Hm, interesting experiment. I can reproduce the memory increase, but only for the first call to tf.complex. For instance, if I add two cells like:

del complex_tnsr
complex_tnsr = tf.complex(tnsr, tnsr)

I see the memory usage decrease after the del, and not increase after the second complex_tnsr assignment.

Also, that'll be running in eager mode, whereas our layer code will not.

Out of curiosity have you tested and timed the consecutive/concatenated approach?

I think so; I believe it was the same as the interleaved approach, both with tf.complex and with manually implemented operations.

huonw

comment created time in 4 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 898183114b2a6b0f933d8b3f0419811420e9bc7c

Add nblink file

view details

push time in 4 days

push eventstellargraph/stellargraph

Kieran Ricardo

commit sha 7000273f477b9e8e1afefa8b9a0f636f7a3985d0

Feature/neo4j cluster gcn (#1567) This imlements a Neo4jClusterNodeGenerator using neo4j to lookup the adjacency structure and node features along with a bare bones demo. To minimize code duplication this PR also implements a very minimal Neo4jStellarGraph. See: #1530 * basic working cluster gcn for neo4j * refactor neo4j query functions into a separate file * minimal Neo4jStellarGraph implementation This commits adds a `Neo4jStellarGraph` class with a few basics methods. It also minimizes code duplication for CluterGCN by passing this new class to `ClusterGCNNodeSequence` * added a FIXME * very minimal demo * use directed neighbours in adj * review suggestion query cleanup * review suggestion remove comment * review suggestion clean-up nodes query * simplify to_adjacency matrix * remove collect from node_ids and node_features queries for a ~30% speedup * improve demo * Update stellargraph/connector/neo4j/graph.py Co-authored-by: Tim Pitman <4321744+timpitman@users.noreply.github.com> * docstrings and issues * formatting * add demo to neo4j notebooks for ci * edited demo * edited demo index * update parallelism * Update stellargraph/connector/neo4j/graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * review suggestions: demo typo fix + added a link to * add a reference to Cluster-GCN on Neo4j * add a reference to Cluster-GCN on Neo4j * added nblink * fix whitespace Co-authored-by: Tim Pitman <4321744+timpitman@users.noreply.github.com> Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au>

view details

Huon Wilson

commit sha 2d3a53d4bb641915b465bbac7aa304f68efdf321

Add a notebook for fine-tuning a DGI-trained model

view details

push time in 4 days

issue openedstellargraph/stellargraph

Add a demo using Deep Graph Infomax to pretrain a model for fine-tuning

Description

Deep Graph Infomax trains a base model. This base model can be further trained after DGI, to fine tune it for a downstream task like node classification. This probably helps in data-scarce environments. We could have a demo of this.

created time in 4 days

create barnchstellargraph/stellargraph

branch : feature/fine-tuning-dgi-demo

created branch time in 4 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 1e96cd2122ae6b8bcc89e8bb59a16bddc8e4fe3a

Fix incorrect merge conflict resolution

view details

push time in 4 days

pull request commentstellargraph/stellargraph

Use interleaved real/imag embeddings for complex KG models

I don't think there's any copying of tensors here? The slicing is cheap, including strided slicing. I believe the complex function doesn't copy things either, but maintains separate real and imag tensors.

I initially had things done on the manual strided tensors and it was noticeably slower than using TF's native complex support.

huonw

comment created time in 4 days

Pull request review commentstellargraph/stellargraph

Feature/1531 generalize cluster gcn

 def call(self, inputs):                 "Currently full-batch methods only support a batch dimension of one"             ) -        # Remove singleton batch dimension-        features = K.squeeze(features, 0)-        propagated_features = K.squeeze(propagated_features, 0)-         # Propagate the node features         A = As[0]-        output = (1 - self.teleport_probability) * K.dot(-            A, propagated_features-        ) + self.teleport_probability * features+        if K.is_sparse(A):+            propagated_features = K.squeeze(propagated_features, 0)

This could have a FIXME(#1222), since I think that this could be generalised in the same manner.

kieranricardo

comment created time in 4 days

pull request commentstellargraph/stellargraph

Use interleaved real/imag embeddings for complex KG models

Do you have a specific idea for the best way to do that? I don't think the Embedding layer allows for complex types, e.g.:

import tensorflow as tf
idx = tf.constant([[0, 1]])

c = tf.keras.layers.Embedding(10, 20, dtype=tf.complex64)
c(idx) # ValueError: Expected float or integer dtype, got <dtype: 'complex64'>.

i = tf.keras.layers.Embedding(10, 20, embeddings_initializer="zeros", dtype=tf.int32)
i(idx)  
# array([[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]],
#       dtype=int32)>

We'd have to do something like implement our own ComplexEmbedding layer, using tf.nn.embedding_lookup directly internally. I guess this is possible, but it seems kinda unfortunate to have to maintain more code.

It also doesn't generalise to a model like RotationE or RotationH, where relations end up with two separate chunks of embeddings. See https://nbviewer.jupyter.org/gist/huonw/52d585e571fde62e252c54fbe668faa3#RotationE-and-RotationH : The m_r are computed like RotatE, with a phase argument, while the r_r are complex numbers. This would be phrased in the scheme in this PR, as an embedding layer of size 3n, that gets split into (for instance) 0:n for the phase of m_r, n::2 for the real part of r_r and n+1::2 for the imag part of r_r.

huonw

comment created time in 4 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

Huon Wilson

commit sha c674ef909edd63bd4548806d46cbfc59d0159db8

Merge remote-tracking branch 'origin/develop' into feature/1327-edge-features

view details

push time in 5 days

Pull request review commentstellargraph/stellargraph

Support edge features in StellarGraph objects

 class ElementData:      Args:         ids (sequence): the IDs of each element-        type_starts (list of tuple of type name, int): the starting iloc of the elements of each type within ``shared``+        type_info (list of tuple of type name, numpy array): the associated feature vectors of each type, where the size of the first dimension defines the elements of that type     """ -    def __init__(self, ids, type_starts):-        if not isinstance(type_starts, list):+    def __init__(self, ids, type_info):+        if not isinstance(type_info, list):             raise TypeError(-                f"type_starts: expected list, found {type(type_starts).__name__}"+                f"type_info: expected list, found {type(type_info).__name__}"             )          type_ranges = {}-        type_stops = type_starts[1:] + [(None, len(ids))]-        consecutive_types = zip(type_starts, type_stops)-        for idx, ((type_name, start), (_, stop)) in enumerate(consecutive_types):-            if idx == 0 and start != 0:-                raise ValueError(-                    f"type_starts: expected first type ({type_name!r}) to start at index 0, found start {start}"-                )-            if start > stop:+        features = {}+        all_types = []+        type_sizes = []++        rows_so_far = 0++        # validation+        for type_name, data in type_info:+            if not isinstance(data, np.ndarray):                 raise TypeError(-                    f"type_starts (for {type_name!r}): expected valid type range, found start ({start}) after stop ({stop})"+                    f"type_info (for {type_name!r}): expected numpy array, found {type(data).__name__}"                 )++            if len(data.shape) != 2:+                raise ValueError(+                    f"type_info (for {type_name!r}): expected 2 dimensions, found {len(data.shape)}"+                )++            rows, _columns = data.shape

It was on purpose in the original code (that's what the leading _ conveys, in my mind), trying to express the rank-2 nature of data... however, that was all changed by the multidimensional feature change in #1561 that's now merged, and I guess it'll be resolved by the merge with develop.

huonw

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def degrees(self, *, ins=True, outs=True):         adj = self._adj_lookup(ins=ins, outs=outs)         return defaultdict(int, ((key, len(value)) for key, value in adj.items())) -    def edge_ilocs(self, node_id, *, ins, outs) -> np.ndarray:+    def edge_ilocs(+        self, node_iloc, *, ins, outs, other_node_type_iloc=None+    ) -> np.ndarray:         """-        Return the integer locations of the edges for the given node_id+        Return the integer locations of the edges for the given node iloc          Args:-            node_id: the ID of the node+            node_iloc: the iloc of the node           Returns:-            The integer locations of the edges for the given node_id.+            The integer locations of the edges for the given node iloc.         """ -        return self._adj_lookup(ins=ins, outs=outs)[node_id]+        if other_node_type_iloc is not None:+            return self._adj_lookup_by_other_node_type(ins=ins, outs=outs).get(+                other_node_type_iloc, defaultdict(list)+            )[node_iloc]

What's the motivation for returning empty, rather than emitting an error?

kjun9

comment created time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 8c67594d98f2559ee5ff3346b67336d31929ba90

Add queue override to python 3.8 testing too

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha e84e4088e29d8c0d8438c00259e5b0d1352391e0

Add skipifs to reproducibility tests

view details

push time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def f(): @pytest.mark.parametrize(     "num_nodes,num_edges", [(100, 200), (1000, 5000), (20000, 100000)] )-@pytest.mark.parametrize("force_adj_lists", ["directed", "undirected"])+@pytest.mark.parametrize(

In order to capture some of performance of the filtering in the inner loop, I wonder if if it's worth also parameterising this benchmark by the number of node types, e.g.

@pytest.mark.parametrize("n_types", [1, 4, 20])

and passing this through to example_benchmark_graph.

(Potentially with a if force_adj_lists in ("directed", "undirected") and n_types != 1: pytest.skip("...") to avoid spending time on retesting uninteresting things.)

kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def degrees(self, *, ins=True, outs=True):         adj = self._adj_lookup(ins=ins, outs=outs)         return defaultdict(int, ((key, len(value)) for key, value in adj.items())) -    def edge_ilocs(self, node_id, *, ins, outs) -> np.ndarray:+    def edge_ilocs(+        self, node_iloc, *, ins, outs, other_node_type_iloc=None+    ) -> np.ndarray:         """-        Return the integer locations of the edges for the given node_id+        Return the integer locations of the edges for the given node iloc          Args:-            node_id: the ID of the node+            node_iloc: the iloc of the node           Returns:-            The integer locations of the edges for the given node_id.+            The integer locations of the edges for the given node iloc.         """ -        return self._adj_lookup(ins=ins, outs=outs)[node_id]+        if other_node_type_iloc is not None:+            return self._adj_lookup_by_other_node_type(ins=ins, outs=outs).get(+                other_node_type_iloc, defaultdict(list)+            )[node_iloc]

I take it the get+defaultdict(list) here is in case there's node types that don't have any incoming edges or any outgoing edges (e.g. all edges might be a->b)? I'm slightly concerned that creating a defaultdict for every call may be adding noticeable overhead, and, also it's returning a list rather than a NumPy array, so consumers may get confused (and/or have to spend more work converting types).

One option might be pre-filling the _edges..._dict_by_other_node_type values with every node type, e.g.:

class EmptyFlatAdjacencyList:
    def __init__(self, dtype):
        self.empty = np.array([], dtype=dtype)

    def __getitem__(self, idx):
        return empty

Or just creating an O(nodes) FlatAdjacencyList, like FlatAdjacencyList(np.array([], dtype=...), np.zeros(len(nodes) + 1))

Another option would be moving the defaulting to a separate code path, like:

adj_list = self._adj_lookup_by_other_node_type(ins=ins, outs=outs).get(other_node_type_iloc)
if adj_list is None:
    return np.array([], dtype=...)
return adj_list[node_iloc]
kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def _create_undirected_adj_lists(self):             filtered_targets = filtered_targets[:-num_self_loops]          flat_array %= num_edges-        neigh_counts = np.bincount(self.sources, minlength=self.number_of_nodes)-        neigh_counts += np.bincount(filtered_targets, minlength=self.number_of_nodes)-        splits = np.zeros(len(neigh_counts) + 1, dtype=dtype)-        splits[1:] = np.cumsum(neigh_counts, dtype=dtype)+        splits = _get_splits(+            [self.sources, filtered_targets], self.number_of_nodes, dtype+        )          return FlatAdjacencyList(flat_array, splits) +    def _init_directed_adj_lists_by_other_node_type(self):+        (+            self._edges_in_dict_by_other_node_type,+            self._edges_out_dict_by_other_node_type,+        ) = self._create_directed_adj_lists_by_other_node_type()++    def _create_directed_adj_lists_by_other_node_type(self):+        # record the edge ilocs of incoming, outgoing and both-direction edges+        source_types = self.node_data.type_ilocs[self.sources]+        target_types = self.node_data.type_ilocs[self.targets]++        def _to_directed_adj_list(arr, other_node_types):+            index = {}++            # edge ilocs sorted in order of node iloc+            flat = np.argsort(arr).astype(self._id_index.dtype, copy=False)++            for other_node_type in np.unique(other_node_types):++                # filter node ilocs based on other node type+                arr_filtered = arr[other_node_types == other_node_type]+                splits = _get_splits(+                    [arr_filtered], self.number_of_nodes, self._id_index.dtype+                )++                # filter edge ilocs based on other node type+                flat_filtered = flat[other_node_types[flat] == other_node_type]++                index[other_node_type] = FlatAdjacencyList(flat_filtered, splits)++            return index++        return (+            _to_directed_adj_list(self.targets, source_types),+            _to_directed_adj_list(self.sources, target_types),+        )++    def _init_undirected_adj_lists_by_other_node_type(self):+        self._edges_dict_by_other_node_type = (+            self._create_undirected_adj_lists_by_other_node_type()+        )++    def _create_undirected_adj_lists_by_other_node_type(self):

This function is... impressive. It definitely seems super hard to phrase some of this.

kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def _create_undirected_adj_lists(self):             filtered_targets = filtered_targets[:-num_self_loops]          flat_array %= num_edges-        neigh_counts = np.bincount(self.sources, minlength=self.number_of_nodes)-        neigh_counts += np.bincount(filtered_targets, minlength=self.number_of_nodes)-        splits = np.zeros(len(neigh_counts) + 1, dtype=dtype)-        splits[1:] = np.cumsum(neigh_counts, dtype=dtype)+        splits = _get_splits(+            [self.sources, filtered_targets], self.number_of_nodes, dtype+        )          return FlatAdjacencyList(flat_array, splits) +    def _init_directed_adj_lists_by_other_node_type(self):+        (+            self._edges_in_dict_by_other_node_type,+            self._edges_out_dict_by_other_node_type,+        ) = self._create_directed_adj_lists_by_other_node_type()++    def _create_directed_adj_lists_by_other_node_type(self):+        # record the edge ilocs of incoming, outgoing and both-direction edges+        source_types = self.node_data.type_ilocs[self.sources]+        target_types = self.node_data.type_ilocs[self.targets]++        def _to_directed_adj_list(arr, other_node_types):+            index = {}++            # edge ilocs sorted in order of node iloc+            flat = np.argsort(arr).astype(self._id_index.dtype, copy=False)++            for other_node_type in np.unique(other_node_types):++                # filter node ilocs based on other node type+                arr_filtered = arr[other_node_types == other_node_type]

Because of the various whole-array filters here, this loop is O(edges * node types + nodes * node types); it's probably fine, assuming there's not many node types, but we could potentially consider other approaches, e.g.

  • doing a Pandas groupby, which might be O(edges + nodes * node types), although it'll like have more overhead, e.g.
    for other_node_type, flat_filtered in pd.Series(flat).groupby(other_node_types):
        splits = _get_splits([arr[flat_filtered]], self.number_of_nodes, self._id_index.dtype)
        index[other_node_type] = FlatAdjacencyList(flat_filtered, splits)
    
  • sorting other_node_types (and other bits too) so that index returned by np.unique(other_node_types, return_index=True) gives the first index of each node type, and thus we can do O(1) slicing instead of O(edges) comparisons

Both of these are likely to have more overhead and potentially be a bit fiddly (I don't know if the first one is the right code), so I don't know if they'll make sense. I don't know if we have a good "real world" dataset with many node types to test this, but this may also indicate that it's not actually a concern (knowledge graphs can end up with thousands of edge types... but only one node type).

Thoughts?

kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def _create_undirected_adj_lists(self):             filtered_targets = filtered_targets[:-num_self_loops]          flat_array %= num_edges-        neigh_counts = np.bincount(self.sources, minlength=self.number_of_nodes)-        neigh_counts += np.bincount(filtered_targets, minlength=self.number_of_nodes)-        splits = np.zeros(len(neigh_counts) + 1, dtype=dtype)-        splits[1:] = np.cumsum(neigh_counts, dtype=dtype)+        splits = _get_splits(+            [self.sources, filtered_targets], self.number_of_nodes, dtype+        )          return FlatAdjacencyList(flat_array, splits) +    def _init_directed_adj_lists_by_other_node_type(self):+        (+            self._edges_in_dict_by_other_node_type,+            self._edges_out_dict_by_other_node_type,+        ) = self._create_directed_adj_lists_by_other_node_type()++    def _create_directed_adj_lists_by_other_node_type(self):+        # record the edge ilocs of incoming, outgoing and both-direction edges+        source_types = self.node_data.type_ilocs[self.sources]+        target_types = self.node_data.type_ilocs[self.targets]++        def _to_directed_adj_list(arr, other_node_types):

If you felt like it, this nested method could become a top level method, like:

def _to_directed_adj_list(self, node_ilocs, other_node_ilocs):
    ...

def _create_directed_adj_lists_by_other_node_type(self):
    return (
        self._to_directed_adj_list(self.targets, self.sources),
        self._to_directed_adj_list(self.sources, self.targets),
    )
kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def _create_undirected_adj_lists(self):             filtered_targets = filtered_targets[:-num_self_loops]          flat_array %= num_edges-        neigh_counts = np.bincount(self.sources, minlength=self.number_of_nodes)-        neigh_counts += np.bincount(filtered_targets, minlength=self.number_of_nodes)-        splits = np.zeros(len(neigh_counts) + 1, dtype=dtype)-        splits[1:] = np.cumsum(neigh_counts, dtype=dtype)+        splits = _get_splits(+            [self.sources, filtered_targets], self.number_of_nodes, dtype+        )          return FlatAdjacencyList(flat_array, splits) +    def _init_directed_adj_lists_by_other_node_type(self):+        (+            self._edges_in_dict_by_other_node_type,+            self._edges_out_dict_by_other_node_type,+        ) = self._create_directed_adj_lists_by_other_node_type()++    def _create_directed_adj_lists_by_other_node_type(self):+        # record the edge ilocs of incoming, outgoing and both-direction edges+        source_types = self.node_data.type_ilocs[self.sources]+        target_types = self.node_data.type_ilocs[self.targets]++        def _to_directed_adj_list(arr, other_node_types):+            index = {}++            # edge ilocs sorted in order of node iloc+            flat = np.argsort(arr).astype(self._id_index.dtype, copy=False)++            for other_node_type in np.unique(other_node_types):++                # filter node ilocs based on other node type+                arr_filtered = arr[other_node_types == other_node_type]+                splits = _get_splits(+                    [arr_filtered], self.number_of_nodes, self._id_index.dtype+                )++                # filter edge ilocs based on other node type+                flat_filtered = flat[other_node_types[flat] == other_node_type]

The double indexing here is potentially suboptimal, especially because other_node_types[flat] doesn't change between loops but does have to be recomputed. I wonder if there's another way we could phrase this loop to be slightly more efficient even if we don't completely switch it:

            other_node_types = other_node_types[flat]

            for other_node_type in np.unique(other_node_types):
                # filter edge ilocs based on other node type
                flat_filtered = flat[other_node_types == other_node_type]
                # choose the relevant node ilocs based on the edge ilocs
                arr_filtered = arr[flat_filtered]
                splits = _get_splits(
                    [arr_filtered], self.number_of_nodes, self._id_index.dtype
                )

(Not a GitHub suggestion because the rendering is really ugly.)

This reduces us from doing two O(edges) comparisons to doing one O(edges) one, and one O(selected edges) one. This doesn't change the overall asymptotics of the loop but should reduce the overhead. (Moving the other_node_types[flat] out of the loop could be applied even without changing the order of flat_filtered and arr_filtered.)

kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def _create_undirected_adj_lists(self):             filtered_targets = filtered_targets[:-num_self_loops]          flat_array %= num_edges-        neigh_counts = np.bincount(self.sources, minlength=self.number_of_nodes)-        neigh_counts += np.bincount(filtered_targets, minlength=self.number_of_nodes)-        splits = np.zeros(len(neigh_counts) + 1, dtype=dtype)-        splits[1:] = np.cumsum(neigh_counts, dtype=dtype)+        splits = _get_splits(+            [self.sources, filtered_targets], self.number_of_nodes, dtype+        )          return FlatAdjacencyList(flat_array, splits) +    def _init_directed_adj_lists_by_other_node_type(self):+        (+            self._edges_in_dict_by_other_node_type,+            self._edges_out_dict_by_other_node_type,+        ) = self._create_directed_adj_lists_by_other_node_type()++    def _create_directed_adj_lists_by_other_node_type(self):+        # record the edge ilocs of incoming, outgoing and both-direction edges+        source_types = self.node_data.type_ilocs[self.sources]+        target_types = self.node_data.type_ilocs[self.targets]++        def _to_directed_adj_list(arr, other_node_types):

The peak memory use may be able to be reduced by creating the ..._types arrays more transiently, so that they don't both exist at once. In particular, I think they can probably moved inside the helper method:

        def _to_directed_adj_list(node_ilocs, other_node_ilocs):
            other_node_types = self.node_data.type_ilocs[other_node_ilocs]

And then call it like _to_directed_adj_list(self.targets, self.sources) (or vice versa) below.

kjun9

comment created time in 5 days

Pull request review commentstellargraph/stellargraph

Make metapath walk faster by grouping adjacency lists by node type

 def items(self):             yield (idx, self[idx])  +def _get_splits(node_ilocs_list, number_of_nodes, dtype):+    neigh_counts = sum(+        np.bincount(ilocs, minlength=number_of_nodes) for ilocs in node_ilocs_list+    )

I think this will allocate len(node_iloc_list) extra arrays, because it's not mutating in place like the old form did (in particular it'll do something like (0 + bincount) + bincount, and each of those + will create a whole new array). We could resolve this with something more complicated like:

    neigh_counts =  np.bincount(node_ilocs_list[0], minlength=number_of_nodes)
    for ilocs in node_ilocs_list[1:]:
        neigh_counts += np.bincount(ilocs, minlength=number_of_nodes)

But, based on the benchmarks not really changing, I don't think this is worth it, especially because this would be entirely better using https://github.com/numpy/numpy/issues/7998.

kjun9

comment created time in 5 days

pull request commentstellargraph/stellargraph

Change input data to (batch, nodes, timesteps) in GCN-LSTM

Hi @habiba-h, similar to #1580, I'd appreciate your input here. Does this change seem reasonable? Is there some key aspect of time series that means we should retain the old ordering?

huonw

comment created time in 5 days

pull request commentstellargraph/stellargraph

Support multivariate time series in GCN-LSTM

Hi @habiba-h, if you get a chance, I'd appreciate if you could look over the PR description and let me know if what I've described seems like a reasonable approach to generalising to multivariate input series. (Feel free to also look over the code 😄 )

huonw

comment created time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 22a92d2c65dc29b91396778002b26bfde819aee2

Allow StellarGraph construction from internal NodeData/EdgeData (#1545) This allows us to construct a `StellarGraph` without going via Pandas DataFrames: construct an `EdgeData` and `NodeData` in some better way, and then pass them into `StellarGraph`. This opens the possibility of an efficient `StellarGraph.from_numpy` function, as well as potentially more efficient internal construction of StellarGraphs (such as in `subgraph`). This isn't documented, because it's internal-focused, and downstream users should generally not be working with a `NodeData` or `EdgeData` instance at the moment. See: #1524

view details

Huon Wilson

commit sha f5e519be3104f4682b25c973bb5fa5d3229446c3

Add node/edge type iloc conversions to StellarGraph (#1366) This extends the ilocs we expose from just node ilocs, to ilocs for node and edge types as well. This is helpful for algorithms that manipulate these types, such as: - knowledge graphs models, which often compute embeddings per-type, and thus ilocs are great as indices - HinSAGE and Metapath2Vec, which can, theoretically, do comparison and manipulation of integers rather than more complicated types like strings (potentially including using lists/arrays instead of dicts, for any mappings with node/edge types as keys). See: #1237

view details

Huon Wilson

commit sha 9b81deb26191f226a1a213a1c304a3fbe03c45fc

Implement RotatE (#1522) This implements the core of the RotatE algorithm from "RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space" Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, Jian Tang https://arxiv.org/abs/1902.10197 That core is the scoring function and model, as well as some basic tests. I intend to generalise this model to share more code with `ComplEx` and `DistMult` (#1549). Documentation can happen as part of that effort. #1550 tracks adding a demo notebook, which can happen in parallel with #1549. See: #1521

view details

Daokun Zhang

commit sha 08396d057d2306604ddb7e1ac54c07a772c1c1f0

enhance documemtation for keras node2vec

view details

Kieran Ricardo

commit sha ef81955da232007bd7d2c731308515dd783aeac3

Ehancement/1238 no zip np array (#1253) This PR changes adds the following methods: - `StellarGraph.edges_arrays` - `StellarGraph.neighbors_arrays` - `StellarGraph.in_nodes_arrays` - `StellarGraph.out_nodes_arrays` Which return a tuple of np.arrays instead of a list of tuples as the old correpsonding methods without the `_array` suffix did. The old methods now simply wrap the new `_array` methods to preserve the API. See #1238 * _transform edges no returns a tuple of numpy arrays * edges now returns a tuple of numpy arrays * fixes * fix tests * fix tests * fix tests * formatting * update doc strings * fix tests * add compatability for passing edgelists from edge splitter * add in `edge_arrays` function * add to_edges function in EdgeList * added .*_arrays methods for neighbor methods * docstring update * docstring update * docstring update * docstring update * backwards compatability * test new conversion methods * Update tests/data/test_edge_splitter.py Co-Authored-By: kevin <33508488+kjun9@users.noreply.github.com> * Update tests/datasets/test_datasets.py Co-Authored-By: kevin <33508488+kjun9@users.noreply.github.com> * fix * revert fullbatch tests back * revert tests back * revert demo * removed edgelist * Update stellargraph/core/graph.py Co-authored-by: kevin <33508488+kjun9@users.noreply.github.com> * doc string update * removed redundant code * merge fixes * reverted files back * reverted files back * reverted files back * docstring fixes * fix typo * samples use neighbor arrays * benchmark uses neighbor arrays Co-authored-by: kevin <33508488+kjun9@users.noreply.github.com>

view details

Daokun Zhang

commit sha e2086659d8b1c3921978cec6749a78699e6f522a

Merge pull request #1566 from stellargraph/bugfix/keras_node2vec enhance documemtation for keras node2vec

view details

Huon Wilson

commit sha d5b55e31c2034b63d92c15541ed8a3e1a0f861a3

Add a notebook capturing StellarGraph class resource usage (#1547) In many PRs, we've recorded the time and/or memory usage of several real-world datasets, such as #1220, #1284, #1291, #1535. This was done in an ad-hoc way, passing around various versions of a notebook as gists, e.g. 1. initial version: https://gist.github.com/huonw/630907bf02f4ff443bb3d83139bb32aa 2. version with better source/target memory usage: https://gist.github.com/kieranricardo/536776d95bfbc56e8ab709f1d261cb50 3. version with end-to-end memory usage via `tracemalloc`: https://gist.github.com/huonw/deb88a9179cfb588f9d5fff59955f56d This PR expands the last version to something that's appropriate to add to the repo, so that it's easily accessible. The notebook records both time and both long term and peak memory usage for constructing five graphs: - Cora - BlogCatalog3 - FB15k with and without edge types - reddit (optional, if manually downloaded from http://snap.stanford.edu/graphsage/#datasets) These have a variety of sizes and "shapes" (Cora and reddit are homogeneous, BlogCatalog3 has multiple node types and edge types, FB15k has multiple edge types), although there's no heterogeneous graph with node features. The memory usage is measured with [the `tracemalloc` Python stdlib module](https://docs.python.org/3/library/tracemalloc.html). This counts every single allocation, including ones from NumPy. The notebook loads the data freshly from disk when tracing allocations, and thus captures every piece of memory that goes into creating the raw data and the `StellarGraph` object. After creating the `StellarGraph`, the notebook cleans up the input data, leaving only the graph instance. At various points, the peak memory usage and snapshots of the individual allocations are recorded. This allows computing various metrics: - memory usage of the memory controlled by the `StellarGraph` instance itself (`memory (graph)`) - memory usage of the _new_ memory controlled by the instance, that isn't shared with the raw data (`memory (graph, not shared with data)`) - peak memory usage during construction of the graph (`peak memory (graph)`) - peak memory usage during construction, ignoring the base memory usage of the nodes/edges input data (`peak memory (graph, ignoring data)`) - base memory usage of input data (`memory (data)`) - as a check, peak memory usage when loading the input data (`peak memory (data)`), if this is equal to `peak memory (graph)` it likely means that the peak memory usage during graph construction is less than loading data, and so the `peak memory (graph)` is not longer meaningful All of this more accurate that using a retroactive analysis like [`asizeof` from `pympler`](https://pythonhosted.org/Pympler/asizeof.html#asizeof), because it correctly counts allocations from NumPy arrays once and only once. Pympler can miss some memory: for `asizeof(obj)`, suppose `obj` contains an array that is computed as a slice of another one (so `array.flags.owndata == False`), but doesn't contain the parent/base array (`array.base`). `asizeof` seems to count `array` as 0, and thus a potentially-large allocation isn't being counted. (Gist 2 above was attempting to fix this by inserting `np.copy`s at appropriate places, but this is a fragile/inaccurate battle.) This notebook is designed to be used by StellarGraph contributors, and so is data-dense without too much explanation, and is placed into a new `demos/zzz-internal-developers` directory. It is still added to `demos` and our RtD, so that it runs on CI and for consistency. See: #1544

view details

Huon Wilson

commit sha ee3bd69435ba0a7b630dd0ae566f095c99167e66

Use ilocs end-to-end for knowledge graph ranking (#1367) With #1366, there are native and supported APIs for working with edge type ilocs. This allows us to use ilocs end-to-end in knowledge graph ranking. This both simplifies the code and makes it faster, e.g. ranking is performed in the DistMult notebook approximately as follows (the model doesn't need to be trained, because we don't care about the actual ranks, just the time to compute): ```python import stellargraph as sg import tensorflow as tf def run(dataset): graph, train, test, valid = dataset.load() gen = sg.mapper.KGTripleGenerator(graph, batch_size=5000) test_seq = gen.flow(test) model = sg.layer.DistMult(gen, embedding_dimension=100) tf.keras.Model(*model.in_out_tensors()).compile("adam", loss="mse") %time model.rank_edges_against_all_nodes(test_seq, graph) run(sg.datasets.WN18()) run(sg.datasets.FB15k()) ``` Results: | dataset | pre-ilocs f2a96aab (s) | develop f5e519be (s) | this PR (s) | |---------|------------------------|----------------------|-------------| | WN18 | 4.99 | 5.87 | 3.80 | | FB15k | 42.6 | 59.2 | 20.7 | That is, it's somewhere between 1.5-3× faster for DistMult on these datasets.

view details

Huon Wilson

commit sha cc15f176c6658d122d30cf7af3e08d3e139b3974

Add a glossary (#1570) This starts a glossary with a few terms. It's easy to append to it. See: #1268

view details

Kieran Ricardo

commit sha 65930c47ebc274f93555f6ac2207a8565ea99ba5

Store adj dictionaries as one contiguous np array (#1296) This PR stored adj lists in one contiguous numpy array. For each adj dict, this PR instead stores the edge ilocs in one contiguous numpy array sorted by the index (source/target node for in/out dicts). This causes significant memory savings and faster constructor time On our largest benchmarks compared to develop: - Undirected adj lists are 13x smaller, 10x smaller peak, and are 16x faster to construct - Directed adj lists are 7x smaller, 4x smaller peak, and 8x faster to construct - The only other benchmark affected is the get_neighbors, it looks like this PR might potentially -slow this down by 5% but the error bars of this PR and develop overlap * fix * make use of node ilocs * code climate suggestion * use ilocs in tests * improve construction time * fix issues with not storing isolated nodes * numpy-ised directed adj list creation * handle no edges * numpy-ise undirected adj list creation * format commments better * fix _edges_out_dict bug + element data test * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * reverted test back * use bincount + make things more inplace * do things even more inplace * fix: actually sort in place * memory optimizations * document FlatAdjacencyList * make dtype public * refactor * refactor * refactor * put ilocs back in tests * fix index issues * fix index issues * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * review suggestion - remove unecessary type caste * review suggestion - use dtype not _dtype * review suggestion - add isolated node explicitly * review suggestion - add test and comment for `dtype` edge case (no pun intended) * black * remove 0 case because max called is removed * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * add 0 to splits * fix merge error * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * updated FlatAdjacencyList doc string * remove get and index checks in FlatAdajcencyList * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * non-neg iloc check in adj list Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au>

view details

Huon Wilson

commit sha ee2130499eecdec72ad57863a06fa5fda8521199

Add IndexedArray and allow construction from it and Numpy array (#1556) This adds a very basic data frame class, `IndexedArray`, which consists of: - a (multidimensional) NumPy array `values`, where the first axis corresponds to "rows" (i.e. `values[0, ...]` is the first row, `values[1, ...]` is the second, etc.) - an index of row labels, `index`, with one per row, so `index[idx]` is the label for the `values[idx, ...]` row. This class is generally much more limited than a Pandas DataFrame (`IndexedArray` basically exists to be a nicer-to-use `(values, index)` tuple), but also has _much_ less overhead. For instance, the following shows the cost of constructing the two classes with or without data: ```python import stellargraph as sg import pandas as pd import numpy as np # empty %timeit -n 100 sg.IndexedArray() # 2.69 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 pd.DataFrame() # 365 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # data N = 10000 idx = np.random.randint(2**63 - 1, size=N) data = np.random.rand(N, 100) %timeit -n 100 sg.IndexedArray(data, index=idx) # 1.14 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 pd.DataFrame(data, index=idx) # 144 µs ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` By allowing construction from this `IndexedArray` type (and a NumPy array too, by assuming that the IDs are 0, 1, 2, ...), we can reduce the overhead of building a `StellarGraph` object (#1336). The rows correspond to nodes. Construction from a NumPy array/`IndexedArray` also allows for more precise control over memory usage, because the feature arrays are passed around directly, without copies. In addition, and more importantly, it opens the possibility of multidimensional node features (#1524), where each node has an associated matrix or higher-rank object. For a specific example: multivariate time series are most naturally modelled as an array of shape `(num time steps, num variates)` (or vice versa). `IndexedArray` supports multidimensional rows in this PR, by passing in a high-rank array. However, `StellarGraph` does not, because there's a few places that assume vector-per-node features. Changing these would be better handled independently of this core change because it may involve some deprecations and other changes (#1561). General notes: - This is a more limited form of http://xarray.pydata.org/ , but I decided that (a) it was probably good to not add another dependency unless required, and (b) its overhead is similar to Pandas: ```python import xarray as xr %timeit -n 100 xr.DataArray() # 17.7 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 xr.DataArray(data, coords={"dim_0": idx}) # 148 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` - We should have a demo discussing this: #1559 - This doesn't think about how to handle edges via NumPy at all (so we've still got Pandas overhead in any `StellarGraph` construction), although it will be applicable to edge features (#1327). See: #1524 ## Performance With the benchmarks from #1558: Construction time (`test_benchmark_creation`), median, in ms: | features | nodes | edges | Pandas | IndexedArray | speedup | |----------|-------|--------|--------|--------------|---------| | None | 0 | 0 | 3.77 | 1.28 | 3.0× | | None | 20000 | 100000 | 5.36 | 2.90 | 1.8× | | 100 | 20000 | 100000 | 17.3 | 2.86 | 6.0× | (It's potentially unexpected that the `IndexedArray` `None-20000-100000` is slower than `100-20000-100000`, but there's nothing that dependencies on the feature size in the `IndexedArray` code path, and they're well within error range of each other.) Long-term memory usage (`test_allocation_benchmark_creation`), median, in KB: | features | nodes | edges | Pandas | IndexedArray | change | |----------|-------|-------|--------|--------------|---------------| | None | 0 | 0 | 7.0 | 5.3 | -1.7 (-24%) | | None | 1000 | 5000 | 64.1 | 58.5 | 5.6 (-9%) | | 100 | 1000 | 5000 | 459.6 | 58.5 | -401.1 (-87%) | This reflects the features being guaranteed-shared with the original array, meaning that having features vs. not having features are the same. The memory cost of the `StellarGraph` object is just the graph structure. Peak memory usage during construction (`test_allocation_benchmark_creation_peak`), median, in KB: | features | nodes | edges | Pandas | IndexedArray | change | |----------|-------|-------|--------|--------------|---------------| | None | 0 | 0 | 9.0 | 7.2 | -1.8 (-20%) | | None | 1000 | 5000 | 84.2 | 78.3 | -5.9 (-7%) | | 100 | 1000 | 5000 | 703.5 | 78.3 | -625.2 (-89%) | Real world datasets (using an extended form of the notebook from #1547), formatted like `pandas → numpy`. | dataset | time (s) | memory (MB) | memory not shared with data (MB) | peak memory (MB) | |---------|----------------|-------------|----------------------------------|------------------| | Cora | 0.020 → 0.0019 | 15.6 → 15.6 | 15.6 → 0.062 | 46.8 → 32.0 | | Reddit | 3.01 → 0.55 | 712 → 712 | 712 → 151 | 1868 → 1682(*) | (*) This is the same as the peak memory required for for loading the data, so the true peak for just creating the `StellarGraph` is likely lower. In summary, construction from a NumPy array/`IndexedArray` is much faster (10× on Cora, 5.5× on Reddit), uses less "new" memory (i.e. doesn't duplicate feature arrays) and has lower peak memory usage.

view details

Huon Wilson

commit sha 370745cbc5a7ed2a2ce35a54139ba576968053df

Add custom 404 page (#1577) This uses https://sphinx-notfound-page.readthedocs.io to add a custom 404 error page, so that someone landing on our documentation can attempt to find something useful, rather than having the completely link-less default RtD "maze" 404 page (e.g. https://abasdfdas.readthedocs.io). This gives two suggestions to a user who lands on an invalid page: - using the sidebar - doing a search, with a prefilled link by pulling out the alphanumeric parts of the URL path (ignoring the version section), e.g. `/en/latest/demos/graph-classification/gcn.html` gets turned into a link to a search for `demos graph classification gcn` (a potential improvement here would be to do that search directly in the 404 page) See: #1576

view details

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

Huon Wilson

commit sha 4aa30ab28b0c05272957b74782d53ce987245347

Merge remote-tracking branch 'origin/develop' into feature/1227-gpu-testing

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 370745cbc5a7ed2a2ce35a54139ba576968053df

Add custom 404 page (#1577) This uses https://sphinx-notfound-page.readthedocs.io to add a custom 404 error page, so that someone landing on our documentation can attempt to find something useful, rather than having the completely link-less default RtD "maze" 404 page (e.g. https://abasdfdas.readthedocs.io). This gives two suggestions to a user who lands on an invalid page: - using the sidebar - doing a search, with a prefilled link by pulling out the alphanumeric parts of the URL path (ignoring the version section), e.g. `/en/latest/demos/graph-classification/gcn.html` gets turned into a link to a search for `demos graph classification gcn` (a potential improvement here would be to do that search directly in the 404 page) See: #1576

view details

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

Huon Wilson

commit sha 613daad6be7b468b6fd82ec14f366892b205cd28

Change input data to (batch, nodes, time) in GCN-LSTM

view details

Huon Wilson

commit sha 9689f159a70b85c07e6a1e583452e5789e45a4f7

Add SlidingFeaturesNodeGenerator for time series/sequence data

view details

Huon Wilson

commit sha 5f9c6bce85fe1c9ec8c75efaa0b64c00275bd2ea

Add basic support for SlidingFeaturesNodeGenerator to GraphConvolutionLSTM

view details

Huon Wilson

commit sha 9425d373467f8cfff1f3634d0e0ade9d3cc271a9

Support multivariate time series in GCN-LSTM

view details

push time in 5 days

delete branch stellargraph/stellargraph

delete branch : ehancement/adj-list-by-type

delete time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha b1e85efa4d7295f41d5f75e615b214d59b44eeeb

Add extra sections

view details

push time in 5 days

pull request commentstellargraph/stellargraph

Feature/neo4j cluster gcn

I've added the demo, but i'm not sure what you by adding an nblink. How do i go about that?

One way is to look at any of the nblink files that already exist and duplicate it, and change the path inside. One of the Neo4j ones will make this easiest, e.g. https://github.com/stellargraph/stellargraph/blob/develop/docs/demos/connector/neo4j/load-cora-into-neo4j.nblink .

There is a scripts/create-nbsphinx-links.sh but it's a bit fragile at the moment.

kieranricardo

comment created time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha b9d2e2fd94b771e7c267848ca641ccd9952f58a6

Add word list

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha a35feb499fee070ca8464759fbf01fa944aca0be

pass through opts explicitly

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 370745cbc5a7ed2a2ce35a54139ba576968053df

Add custom 404 page (#1577) This uses https://sphinx-notfound-page.readthedocs.io to add a custom 404 error page, so that someone landing on our documentation can attempt to find something useful, rather than having the completely link-less default RtD "maze" 404 page (e.g. https://abasdfdas.readthedocs.io). This gives two suggestions to a user who lands on an invalid page: - using the sidebar - doing a search, with a prefilled link by pulling out the alphanumeric parts of the URL path (ignoring the version section), e.g. `/en/latest/demos/graph-classification/gcn.html` gets turned into a link to a search for `demos graph classification gcn` (a potential improvement here would be to do that search directly in the 404 page) See: #1576

view details

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

Huon Wilson

commit sha 613daad6be7b468b6fd82ec14f366892b205cd28

Change input data to (batch, nodes, time) in GCN-LSTM

view details

Huon Wilson

commit sha 9689f159a70b85c07e6a1e583452e5789e45a4f7

Add SlidingFeaturesNodeGenerator for time series/sequence data

view details

Huon Wilson

commit sha 5f9c6bce85fe1c9ec8c75efaa0b64c00275bd2ea

Add basic support for SlidingFeaturesNodeGenerator to GraphConvolutionLSTM

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 370745cbc5a7ed2a2ce35a54139ba576968053df

Add custom 404 page (#1577) This uses https://sphinx-notfound-page.readthedocs.io to add a custom 404 error page, so that someone landing on our documentation can attempt to find something useful, rather than having the completely link-less default RtD "maze" 404 page (e.g. https://abasdfdas.readthedocs.io). This gives two suggestions to a user who lands on an invalid page: - using the sidebar - doing a search, with a prefilled link by pulling out the alphanumeric parts of the URL path (ignoring the version section), e.g. `/en/latest/demos/graph-classification/gcn.html` gets turned into a link to a search for `demos graph classification gcn` (a potential improvement here would be to do that search directly in the 404 page) See: #1576

view details

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

Huon Wilson

commit sha 613daad6be7b468b6fd82ec14f366892b205cd28

Change input data to (batch, nodes, time) in GCN-LSTM

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 0876f269db88f6131fdbb9716320a0236fc2c24d

Run spelling command

view details

push time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 845f696908553132b4fb7449f26a0de511f89936

Install enchant

view details

push time in 5 days

create barnchstellargraph/stellargraph

branch : feature/1368-spellcheck-docs

created branch time in 5 days

issue commentstellargraph/stellargraph

Integrate and run on OGB datasets

See: https://github.com/stellargraph/stellargraph/pull/1177#issuecomment-631782568

huonw

comment created time in 5 days

delete branch stellargraph/stellargraph

delete branch : feature/1169-ogb-datasets

delete time in 5 days

pull request commentstellargraph/stellargraph

Add loaders for some OGB datasets

The 1.0.1 versions that we were testing with here failed on CI in the OGB code itself, because that used too much memory for the testing instances; before even getting to the StellarGraph code.

There's a been a few recent releases of OGB (see https://github.com/snap-stanford/ogb/releases), I don't know if this will help.

In any case, this is stalled and lower priority for now, so closing.

huonw

comment created time in 5 days

pull request commentstellargraph/stellargraph

Allow the sidebar ToC to be deeper

The theme seems to be building up to a 0.5 release (e.g. https://github.com/readthedocs/sphinx_rtd_theme/releases/tag/0.5.0rc1 was released two weeks ago). That works better for this case, e.g. https://stellargraph--1512.org.readthedocs.build/en/1512/demos/basics/loading-pandas.html

image

huonw

comment created time in 5 days

delete branch stellargraph/stellargraph

delete branch : feature/faster-doc-builds-via-commonmark

delete time in 5 days

PR closed stellargraph/stellargraph

Try using new nbsphinx_markdown_renderer option for doc builds

This is an experiment with making our doc builds (much) faster via https://github.com/spatialaudio/nbsphinx/pull/455. It's not "production-ready", e.g. there's no mathematics support.

Timing on macOS (via $ make clean && time make html in docs/), compared to the current state of develop (4e706203):

time develop (s) this PR (s)
real 121 45
user 64 41
sys 41 2.0

The basic rendering seems reasonable, for instance:

  • https://stellargraph--1517.org.readthedocs.build/en/1517/demos/basics/loading-pandas.html
  • https://stellargraph--1517.org.readthedocs.build/en/1517/demos/node-classification/gcn-node-classification.html

Equations fail, though https://stellargraph--1517.org.readthedocs.build/en/1517/demos/graph-classification/dgcnn-graph-classification.html#Create-the-Keras-graph-classification-model

The model’s input is the graph represented by its adjacency and node features matrices. The first four layers are Graph Convolutional as in [2] but using the adjacency normalisation from [1], $D^{-1}A$ where $A$ is the adjacency matrix with self loops and $D$ is the corresponding degree matrix. The graph convolutional layers each have 32, 32, 32, 1 units and tanh activations.

+3 -1

2 comments

2 changed files

huonw

pr closed time in 5 days

pull request commentstellargraph/stellargraph

Try using new nbsphinx_markdown_renderer option for doc builds

No further immediate further progress on this experiment, so closing.

huonw

comment created time in 5 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha cd41384883b07730069017f15d06fc7ccf4e9f2d

Allow multidimensional features in StellarGraph (#1561) This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a `StellarGraph` object can now have a feature array of shape N × F1 × F2 × ..., corresponding to feature tensors of shape F1 × F2 × ... for each node. This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node. None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because `StellarGraph.check_graph_for_ml` calls `StellarGraph.node_feature_sizes` which emits an error for node types with complicated shapes, for backwards compatibility. See: #1524

view details

push time in 5 days

delete branch stellargraph/stellargraph

delete branch : feature/1524-multi-dimensional-features

delete time in 5 days

PR merged stellargraph/stellargraph

Reviewers
Allow multidimensional features in StellarGraph

This builds on the construct-from-NumPy support in #1556 to allow node features to be multidimensional. That is, instead of just a 2D matrix of size N × F, corresponding to feature vectors of length F for each of the N nodes, a StellarGraph object can now have a feature array of shape N × F<sub>1</sub> × F<sub>2</sub> × ..., corresponding to feature tensors of shape F<sub>1</sub> × F<sub>2</sub> × ... for each node.

This is useful for multivariate time series, where the most natural modelling of the data is a time_steps × num_variates matrix (or the transpose) for each node.

None of our models will likely do anything useful with higher rank feature tensors, at the moment (#1527). They'll all throw errors because StellarGraph.check_graph_for_ml calls StellarGraph.node_feature_sizes which emits an error for node types with complicated shapes, for backwards compatibility.

See: #1524

+77 -27

1 comment

4 changed files

huonw

pr closed time in 5 days

delete branch stellargraph/stellargraph

delete branch : enhancement/1216-ragged-adj-list

delete time in 6 days

delete branch stellargraph/stellargraph

delete branch : feature/1043-graph-classification

delete time in 6 days

delete branch stellargraph/stellargraph

delete branch : feature/graph/generator

delete time in 6 days

pull request commentstellargraph/stellargraph

Ehancement/adj list by type

What's the status of this PR? Does it relate to #1399?

kieranricardo

comment created time in 6 days

delete branch stellargraph/stellargraph

delete branch : bugfix/1292-rewrite-readme-links

delete time in 6 days

delete branch stellargraph/stellargraph

delete branch : feature/1521-rotate

delete time in 6 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha 519b0761b14e74fea3eb0c727599f96e793ed233

Fix tests properly

view details

push time in 6 days

delete branch stellargraph/stellargraph

delete branch : feature/1237-type-ilocs

delete time in 6 days

delete branch stellargraph/stellargraph

delete branch : feature/1524-from-numpy

delete time in 6 days

pull request commentpython/cpython

bpo-40630: add tracemalloc.reset_peak

I've added it. It's unfortunate that the review cycles meant this missed the 3.9 beta cut off, and so will presumably have to wait a year+ to be released, despite being such a small improvement.

huonw

comment created time in 6 days

push eventhuonw/cpython

Lumír 'Frenzy' Balhar

commit sha e77d428856fbd339faee44ff47214eda5fb51d57

bpo-40495: compileall option to hardlink duplicate pyc files (GH-19901) compileall is now able to use hardlinks to prevent duplicates in a case when .pyc files for different optimization levels have the same content. Co-authored-by: Miro Hrončok <miro@hroncok.cz> Co-authored-by: Victor Stinner <vstinner@python.org>

view details

Victor Stinner

commit sha 97f33c35445e6d67df24dcbafef7b78333feb778

bpo-40549: posixmodule.c uses defining_class (GH-20075) Pass PEP 573 defining_class to os.DirEntry methods. The module state is now retrieve from defining_class rather than Py_TYPE(self), to support subclasses (even if DirEntry doesn't support subclasses yet). * Pass the module rather than defining_class to DirEntry_fetch_stat(). * Only get the module state once in _posix_clear(), _posix_traverse() and _posixmodule_exec().

view details

Victor Stinner

commit sha f2c3b6823bc4777d4a14eb0c3615b719521f763a

Revert "bpo-32604: [_xxsubinterpreters] Propagate exceptions. (GH-19768)" (GH-20089) * Revert "bpo-40613: Remove compiler warning from _xxsubinterpretersmodule (GH-20069)" This reverts commit fa0a66e62d087765dbc5c1b89d6149a23ecfb0a6. * Revert "bpo-32604: [_xxsubinterpreters] Propagate exceptions. (GH-19768)" This reverts commit a1d9e0accd33af1d8e90fc48b34c13d7b07dcf57.

view details

Victor Stinner

commit sha a482dc500b6ec4889f6a126ba08cbad6c11e37bc

bpo-40602: Write unit tests for _Py_hashtable_t (GH-20091) Cleanup also hashtable.c. Rename _Py_hashtable_t members: * Rename entries to nentries * Rename num_buckets to nbuckets

view details

Pablo Galindo

commit sha bcc30360951a303aa72b0502b77aad2c5f09f30d

bpo-40619: Correctly handle error lines in programs without file mode (GH-20090)

view details

Lysandros Nikolaou

commit sha ce21cfca7bb2d18921bc4ac27cb064726996c519

bpo-40618: Disallow invalid targets in augassign and except clauses (GH-20083) This commit fixes the new parser to disallow invalid targets in the following scenarios: - Augmented assignments must only accept a single target (Name, Attribute or Subscript), but no tuples or lists. - `except` clauses should only accept a single `Name` as a target. Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>

view details

Victor Stinner

commit sha d2dc827d16479d99927a6923a0347199d7c694fb

bpo-40602: _Py_hashtable_set() reports rehash failure (GH-20077) If _Py_hashtable_set() fails to grow the hash table (rehash), it now fails rather than ignoring the error.

view details

Filipe Laíns

commit sha 75d7257b201a56f950c20cd9f5753a83fff4742b

bpo-40548: GitHub Action workflow: skip jobs on doc only PRs (GH-19983) Signed-off-by: Filipe Laíns <lains@archlinux.org>

view details

Victor Stinner

commit sha edf2643bbb9859403239fe1cb3c212b1a2a8e65c

bpo-40460: Fix typo in idlelib/zzdummy.py (GH-20093) Replace ztest with ztext.

view details

Victor Stinner

commit sha 4b972faf605912092013a1fdbf486c498d002926

bpo-40462: Fix typo in test_json (GH-20094)

view details

Joannah Nanjekye

commit sha 7ba1f75f3f02b4b50ac6d7e17d15e467afa36aac

bpo-38872: Document exec symbol for codeop.compile_command (GH-20047) * Document exec symbol for codeop.compile_command * Remove extra statements Co-authored-by: nanjekyejoannah <joannah.nanjekye@ibm.com>

view details

Pablo Galindo

commit sha 16ab07063cb564c1937714bd39d6915172f005b5

bpo-40334: Correctly identify invalid target in assignment errors (GH-20076) Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>

view details

Filipe Laíns

commit sha 6a78589b6b22878491a4b042bb8b3161e1d120f6

bpo-40548: github actions: pass the changes check on no source changes (GH-20097) Signed-off-by: Filipe Laíns <lains@archlinux.org>

view details

Chris Jerdonek

commit sha 1aa8767baf498a920f0461d1088772a12dcb4d20

Update code comment re: location of struct _is. (GH-20067)

view details

Guido van Rossum

commit sha 15bc9ab301d73f20bff47a12ef05326feb40f797

bpo-40612: Fix SyntaxError edge cases in traceback formatting (GH-20072) This fixes both the traceback.py module and the C code for formatting syntax errors (in Python/pythonrun.c). They now both consistently do the following: - Suppress caret if it points left of text - Allow caret pointing just past end of line - If caret points past end of line, clip to *just* past end of line The syntax error formatting code in traceback.py was mostly rewritten; small, subtle changes were applied to the C code in pythonrun.c. There's still a difference when the text contains embedded newlines. Neither handles these very well, and I don't think the case occurs in practice. Automerge-Triggered-By: @gvanrossum

view details

Chris Jerdonek

commit sha 4a12d121860fb60d56cdcc212817577cac2356d0

Fix typo in code comment in main_loop label. (GH-20068)

view details

Andrew York

commit sha 003708bcf8f2c58d4b65f68318acf164d713e008

Trivial typo fix in _tkinter.c (GH-19622) Change spelling of a #define in _tkinter.c from HAVE_LIBTOMMAMTH to HAVE_LIBTOMMATH, since this is used to keep track of tclTomMath.h, not tclTomMamth.h. No other file seems to refer to this variable.

view details

Victor Stinner

commit sha 6e57237faf0da8904e0130a11350cae3c5062b82

bpo-40055: test_distutils leaves warnings filters unchanged (GH-20095) distutils.tests now saves/restores warnings filters to leave them unchanged. Importing tests imports docutils which imports pkg_resources which adds a warnings filter.

view details

Christian Heimes

commit sha 16d4e6f6f559b4fd21c9d29fea303489f658674f

bpo-40479: Fix hashlib issue with OpenSSL 3.0.0 (GH-20107) OpenSSL 3.0.0-alpha2 was released today. The FIPS_mode() function has been deprecated and removed. It no longer makes sense with the new provider and context system in OpenSSL 3.0.0. EVP_default_properties_is_fips_enabled() is good enough for our needs in unit tests. It's an internal API, too. Signed-off-by: Christian Heimes <christian@python.org>

view details

Christian Heimes

commit sha 62d618c06bd395308b7163dbcb26c7e6d0922033

bpo-40479: Test with latest OpenSSL versions (GH-20108) * 1.0.2u (EOL) * 1.1.0l (EOL) * 1.1.1g * 3.0.0-alpha2 (disabled for now) Build the FIPS provider and create a FIPS configuration file for OpenSSL 3.0.0. Signed-off-by: Christian Heimes <christian@python.org> Automerge-Triggered-By: @tiran

view details

push time in 6 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha ab51a329e50cdbe855aa955d9de5d4db4e12630a

Remove outdated "allow_features"

view details

push time in 6 days

issue commentstellargraph/stellargraph

Node prediction/classification by using a trained model for a new graph with unlabelled nodes

It's been a month without any update, so I'm going to assume that the problem is resolved. If we haven't heard otherwise in a week or so, we'll close. (But, even if we close, feel free to just let us know later, and we can reopen/continue the discussion.)

qillbel

comment created time in 6 days

issue commentstellargraph/stellargraph

Node classification over multiple disconnected graphs

It's been a month without any update, so I'm going to assume that the problem is resolved. If we haven't heard otherwise in a week or so, we'll close. (But, even if we close, feel free to just let us know later, and we can reopen/continue the discussion.)

qillbel

comment created time in 6 days

issue commentstellargraph/stellargraph

MemoryError when creating StellarGraph object with large node_features

It's been 2 months without any update, so I'm going to assume that the problem is resolved or otherwise not relevant. If we haven't heard otherwise in a week or so, we'll close. (But, even if we close, feel free to just let us know later, and we can reopen/continue the discussion.)

lamurian

comment created time in 6 days

issue commentstellargraph/stellargraph

how to train on own dataset for node classification.

It's been 2 months without any update, so I'm going to assume that the problem is resolved. If we haven't heard otherwise in a week or so, we'll close. (But, even if we close, feel free to just let us know later, and we can reopen/continue the discussion.)

vigneshgig

comment created time in 6 days

issue commentstellargraph/stellargraph

dataload() giving error: G, node_subjects = dataset.load() Core has no attribute load

It's been 2 months without any update, so I'm going to assume that the problem is resolved. If we haven't heard otherwise in a week or so, we'll close. (But, even if we close, feel free to just let us know later, and we can reopen/continue the discussion.)

engrAsmasaattar

comment created time in 6 days

Pull request review commentstellargraph/stellargraph

Feature/neo4j cluster gcn

+{

<p>It'd be good to refer to the real Cluster-GCN example ( <code>../../node-classification/cluster-gcn-node-classification.ipynb</code> ) in at least one place in this notebook, similar to how the Neo4j + GraphSAGE ones refer to the main GraphSAGE one.</p> <br/>

Reply via <a href='https://app.reviewnb.com/stellargraph/stellargraph/pull/1567/discussion/'>ReviewNB</a> <div id='ReviewNBCommentContext-DoNotDelete' style='display:none' data-state='OPEN' data-cellIndex='18'/>

kieranricardo

comment created time in 6 days

Pull request review commentstellargraph/stellargraph

Feature/neo4j cluster gcn

+{

<p>demonsrates -> demonstrates</p><p> </p><p>cora -> Cora</p> <br/>

Reply via <a href='https://app.reviewnb.com/stellargraph/stellargraph/pull/1567/discussion/'>ReviewNB</a> <div id='ReviewNBCommentContext-DoNotDelete' style='display:none' data-state='OPEN' data-cellIndex='2'/>

kieranricardo

comment created time in 6 days

push eventstellargraph/stellargraph

Huon Wilson

commit sha cc15f176c6658d122d30cf7af3e08d3e139b3974

Add a glossary (#1570) This starts a glossary with a few terms. It's easy to append to it. See: #1268

view details

Kieran Ricardo

commit sha 65930c47ebc274f93555f6ac2207a8565ea99ba5

Store adj dictionaries as one contiguous np array (#1296) This PR stored adj lists in one contiguous numpy array. For each adj dict, this PR instead stores the edge ilocs in one contiguous numpy array sorted by the index (source/target node for in/out dicts). This causes significant memory savings and faster constructor time On our largest benchmarks compared to develop: - Undirected adj lists are 13x smaller, 10x smaller peak, and are 16x faster to construct - Directed adj lists are 7x smaller, 4x smaller peak, and 8x faster to construct - The only other benchmark affected is the get_neighbors, it looks like this PR might potentially -slow this down by 5% but the error bars of this PR and develop overlap * fix * make use of node ilocs * code climate suggestion * use ilocs in tests * improve construction time * fix issues with not storing isolated nodes * numpy-ised directed adj list creation * handle no edges * numpy-ise undirected adj list creation * format commments better * fix _edges_out_dict bug + element data test * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * reverted test back * use bincount + make things more inplace * do things even more inplace * fix: actually sort in place * memory optimizations * document FlatAdjacencyList * make dtype public * refactor * refactor * refactor * put ilocs back in tests * fix index issues * fix index issues * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * review suggestion - remove unecessary type caste * review suggestion - use dtype not _dtype * review suggestion - add isolated node explicitly * review suggestion - add test and comment for `dtype` edge case (no pun intended) * black * remove 0 case because max called is removed * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * Update stellargraph/core/element_data.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * add 0 to splits * fix merge error * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * updated FlatAdjacencyList doc string * remove get and index checks in FlatAdajcencyList * Update tests/core/test_graph.py Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au> * non-neg iloc check in adj list Co-authored-by: Huon Wilson <Huon.Wilson@data61.csiro.au>

view details

Huon Wilson

commit sha ee2130499eecdec72ad57863a06fa5fda8521199

Add IndexedArray and allow construction from it and Numpy array (#1556) This adds a very basic data frame class, `IndexedArray`, which consists of: - a (multidimensional) NumPy array `values`, where the first axis corresponds to "rows" (i.e. `values[0, ...]` is the first row, `values[1, ...]` is the second, etc.) - an index of row labels, `index`, with one per row, so `index[idx]` is the label for the `values[idx, ...]` row. This class is generally much more limited than a Pandas DataFrame (`IndexedArray` basically exists to be a nicer-to-use `(values, index)` tuple), but also has _much_ less overhead. For instance, the following shows the cost of constructing the two classes with or without data: ```python import stellargraph as sg import pandas as pd import numpy as np # empty %timeit -n 100 sg.IndexedArray() # 2.69 µs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 pd.DataFrame() # 365 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # data N = 10000 idx = np.random.randint(2**63 - 1, size=N) data = np.random.rand(N, 100) %timeit -n 100 sg.IndexedArray(data, index=idx) # 1.14 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 pd.DataFrame(data, index=idx) # 144 µs ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` By allowing construction from this `IndexedArray` type (and a NumPy array too, by assuming that the IDs are 0, 1, 2, ...), we can reduce the overhead of building a `StellarGraph` object (#1336). The rows correspond to nodes. Construction from a NumPy array/`IndexedArray` also allows for more precise control over memory usage, because the feature arrays are passed around directly, without copies. In addition, and more importantly, it opens the possibility of multidimensional node features (#1524), where each node has an associated matrix or higher-rank object. For a specific example: multivariate time series are most naturally modelled as an array of shape `(num time steps, num variates)` (or vice versa). `IndexedArray` supports multidimensional rows in this PR, by passing in a high-rank array. However, `StellarGraph` does not, because there's a few places that assume vector-per-node features. Changing these would be better handled independently of this core change because it may involve some deprecations and other changes (#1561). General notes: - This is a more limited form of http://xarray.pydata.org/ , but I decided that (a) it was probably good to not add another dependency unless required, and (b) its overhead is similar to Pandas: ```python import xarray as xr %timeit -n 100 xr.DataArray() # 17.7 µs ± 825 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit -n 100 xr.DataArray(data, coords={"dim_0": idx}) # 148 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` - We should have a demo discussing this: #1559 - This doesn't think about how to handle edges via NumPy at all (so we've still got Pandas overhead in any `StellarGraph` construction), although it will be applicable to edge features (#1327). See: #1524 ## Performance With the benchmarks from #1558: Construction time (`test_benchmark_creation`), median, in ms: | features | nodes | edges | Pandas | IndexedArray | speedup | |----------|-------|--------|--------|--------------|---------| | None | 0 | 0 | 3.77 | 1.28 | 3.0× | | None | 20000 | 100000 | 5.36 | 2.90 | 1.8× | | 100 | 20000 | 100000 | 17.3 | 2.86 | 6.0× | (It's potentially unexpected that the `IndexedArray` `None-20000-100000` is slower than `100-20000-100000`, but there's nothing that dependencies on the feature size in the `IndexedArray` code path, and they're well within error range of each other.) Long-term memory usage (`test_allocation_benchmark_creation`), median, in KB: | features | nodes | edges | Pandas | IndexedArray | change | |----------|-------|-------|--------|--------------|---------------| | None | 0 | 0 | 7.0 | 5.3 | -1.7 (-24%) | | None | 1000 | 5000 | 64.1 | 58.5 | 5.6 (-9%) | | 100 | 1000 | 5000 | 459.6 | 58.5 | -401.1 (-87%) | This reflects the features being guaranteed-shared with the original array, meaning that having features vs. not having features are the same. The memory cost of the `StellarGraph` object is just the graph structure. Peak memory usage during construction (`test_allocation_benchmark_creation_peak`), median, in KB: | features | nodes | edges | Pandas | IndexedArray | change | |----------|-------|-------|--------|--------------|---------------| | None | 0 | 0 | 9.0 | 7.2 | -1.8 (-20%) | | None | 1000 | 5000 | 84.2 | 78.3 | -5.9 (-7%) | | 100 | 1000 | 5000 | 703.5 | 78.3 | -625.2 (-89%) | Real world datasets (using an extended form of the notebook from #1547), formatted like `pandas → numpy`. | dataset | time (s) | memory (MB) | memory not shared with data (MB) | peak memory (MB) | |---------|----------------|-------------|----------------------------------|------------------| | Cora | 0.020 → 0.0019 | 15.6 → 15.6 | 15.6 → 0.062 | 46.8 → 32.0 | | Reddit | 3.01 → 0.55 | 712 → 712 | 712 → 151 | 1868 → 1682(*) | (*) This is the same as the peak memory required for for loading the data, so the true peak for just creating the `StellarGraph` is likely lower. In summary, construction from a NumPy array/`IndexedArray` is much faster (10× on Cora, 5.5× on Reddit), uses less "new" memory (i.e. doesn't duplicate feature arrays) and has lower peak memory usage.

view details

Huon Wilson

commit sha 370745cbc5a7ed2a2ce35a54139ba576968053df

Add custom 404 page (#1577) This uses https://sphinx-notfound-page.readthedocs.io to add a custom 404 error page, so that someone landing on our documentation can attempt to find something useful, rather than having the completely link-less default RtD "maze" 404 page (e.g. https://abasdfdas.readthedocs.io). This gives two suggestions to a user who lands on an invalid page: - using the sidebar - doing a search, with a prefilled link by pulling out the alphanumeric parts of the URL path (ignoring the version section), e.g. `/en/latest/demos/graph-classification/gcn.html` gets turned into a link to a search for `demos graph classification gcn` (a potential improvement here would be to do that search directly in the 404 page) See: #1576

view details

Huon Wilson

commit sha e976ad5424c18a3d3a295841dfe2f6434ebaed69

Code review: rename, simplify

view details

Huon Wilson

commit sha d9383f357c8f1cb8ef98785465325525e123a1d1

Merge remote-tracking branch 'origin/develop' into feature/1327-edge-features

view details

push time in 6 days

pull request commentstellargraph/stellargraph

Support edge features in StellarGraph objects

I did briefly wonder if it would make more sense now for TemporalRandomWalk to use a single edge feature instead of edge weights once this lands, but I'm not sure if that's actually better.

Yeah, that's an interesting thought. I guess one could have it be a function of edge weights and edge features, which defaults to just returning the edge weights for backwards compatibility, but would open it to more complicated things.

huonw

comment created time in 6 days

more