profile
viewpoint

amygdala/tensorflow-workshop 623

This repo contains materials for use in a TensorFlow workshop.

mrry/ciel 71

A distributed execution engine for cloud computing

mrry/skywriting 19

Language and execution engine for cloud programming

avsm/skywriting-www 4

Website for Skywriting

mrry/ciel-skywriting 4

A scripting language for distributed, parallel computation

mrry/ciel-java 2

CIEL bindings for the Java programming language

mrry/ciel-c 1

C bindings for CIEL

mrry/docs 1

TensorFlow documentation

mrry/magenta 1

Magenta: Music and Art Generation with Machine Intelligence

mrry/mrry.github.io 1

Derek Murray's personal site

issue closedmicrosoft/onnxruntime

Segmentation fault while training mnist

Describe the bug I've pulled the code from master, built it from source and tried to run onnxruntime_training_mnist. I got a segmentation fault though.

Urgency Not urgent

System information OS Platform and Distribution: Ubuntu 18.04.5 LTS ONNX Runtime installed from: source ONNX Runtime version: onnxruntime-gpu-1.5.0 GCC/Compiler version: 7.5.0 CUDA/cuDNN version: CUDA version 11.0 (and NCCL 2.7.8) GPU model and memory: Tesla V100-PCIE-16GB

To Reproduce

Built onnxruntime with:

./build.sh --enable_training --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/local/cuda --config=RelWithDebInfo --build_wheel  --skip_tests --parallel

Tried to train mnist with: onnxruntime_training_mnist --model_name mnist_gemm --train_data_dir mnist_data --log_dir logs/ --use_cuda

I got mnist_gemm and mnist_data from here: https://github.com/microsoft/onnxruntime/issues/3706

closed time in 8 days

jupvfranco

issue commentmicrosoft/onnxruntime

Segmentation fault while training mnist

Thanks for confirming Juliana!

jupvfranco

comment created time in 8 days

issue commentmicrosoft/onnxruntime

Segmentation fault while training mnist

Sorry about that @jupvfranco! I notice that @edgchen1 just submitted #5186, which fixes a segfault and involves a change to nccl_service.cc:266... Edward: do you think that your fix will also fix this problem?

jupvfranco

comment created time in 9 days

issue commentmicrosoft/onnxruntime

Segmentation error when using graph optimization

Thanks for the update. A couple more questions:

  • Does the same error occur when loading a different ONNX model (e.g. the simple MNIST model from https://github.com/onnx/models/blob/master/vision/classification/mnist/model/mnist-8.tar.gz)?
  • Can you include the full stack trace that gdb reports? (It probably won't tell us much more, but it may have some useful information.)
NagarajSMurthy

comment created time in 9 days

issue commentmicrosoft/onnxruntime

Segmentation error when using graph optimization

Sorry about that, @NagarajSMurthy! Could you capture a stack trace for the segmentation error (e.g. by running the Python process under gdb)? That will help us to identify which component is causing the problem?

Also, to clarify, does the failure still occur if you do not pass any SessionOptions when creating the InferenceSession?

NagarajSMurthy

comment created time in 10 days

pull request commentmicrosoft/onnxruntime

Fix mnist example

Thanks for fixing this!

thiagocrepaldi

comment created time in a month

PullRequestReviewEvent

Pull request review commentmicrosoft/onnxruntime

Port legacy checkpoint API into new front-end

+from collections import OrderedDict+import numpy as np+import onnx+import os+import torch+import warnings+++################################################################################+# Experimental Checkpoint APIs+################################################################################+++def experimental_state_dict(ort_trainer):+    if not ort_trainer._training_session:+        warnings.warn("ONNX Runtime training session is not initialized yet. "+                        "Please run train_step or eval_step at least once before calling state_dict().")+        return {}++    # extract trained weights+    session_state = ort_trainer._training_session.get_state()+    torch_state = {}+    for name in session_state:+        torch_state[name] = torch.from_numpy(session_state[name])++    # extract untrained weights and buffer+    for n in ort_trainer._onnx_model.graph.initializer:+        if n.name not in torch_state:+            torch_state[n.name] = torch.from_numpy(np.array(onnx.numpy_helper.to_array(n)))++    # Need to remove redundant initializers and name suffices to map back to original torch state names+    torch_state_to_return = {key: torch_state[key] for key in ort_trainer._original_model_state_keys if key in torch_state} \+                            if ort_trainer._original_model_state_keys else torch_state+    return torch_state_to_return++def experimental_load_state_dict(ort_trainer, state_dict, strict=False):

This works for me! (Originally I had imagined that we'd just have onnxruntime.experimental.load_state_dict() etc. without the experimental_ on the function name, but I actually prefer this "uglier"/more verbose version, since it'll be easier to grep for and more robust to namespace aliasing.)

thiagocrepaldi

comment created time in a month

PullRequestReviewEvent

delete branch microsoft/onnxruntime

delete branch : mrry-patch-1

delete time in a month

PR closed microsoft/onnxruntime

[WIP] Set `GraphResolveNeeded(true)` after adding a node

Description: Adds a call to GraphResolveNeeded(true) in Graph::AddNode(), and calls Graph::Resolve() after modifying the graph in MatMulScaleFusion.

Motivation and Context

  • At present, after calling Graph::AddNode() (e.g. in an optimization pass) the Node::op_ field for the added nodes may be nullptr. This can cause segfaults in subsequent code that accesses this field.
  • We must also actually resolve the graph during the optimization pass, to ensure that the nodes we access during the same pass are resolved.
+2 -0

0 comment

2 changed files

mrry

pr closed time in a month

Pull request review commentmicrosoft/onnxruntime

[WIP] Set `GraphResolveNeeded(true)` after adding a node

 Status ProcessNode(     }   } +  graph.Resolve();   modified = true; 

That makes sense! Anyway, @edgchen1 has an actual fix for the underlying problem in #4775, so this shouldn't be necessary. I'm going to drop this PR for now.

mrry

comment created time in a month

Pull request review commentmicrosoft/onnxruntime

[WIP] Set `GraphResolveNeeded(true)` after adding a node

 Status ProcessNode(     }   } +  graph.Resolve();   modified = true; 

The problem we're seeing is that the test on L174 dereferences node.Op() on a "TransposeScaleMatMul" op that appears to have been added in an earlier call to this function, because node.Op() returns nullptr.

Looking at the GraphViewer logic in ApplyImpl(), however, I'm not sure why a newly added node in the same pass would appear in the node_indices.

mrry

comment created time in a month

push eventmicrosoft/onnxruntime

Derek Murray

commit sha b24b407676f565d4b4f2065089d63ee5d5a5454f

Add call to Graph::Resolve() in MatMulScaleFusion. After modifying a node, we need to resolve the graph, because we may end up processing a node that was added by the same optimization pass.

view details

push time in a month

push eventmicrosoft/onnxruntime

Derek Murray

commit sha 2b877fc88a78aaaaf9abb9b57d629b00f0eb0219

Remove added whitespace.

view details

push time in a month

PR opened microsoft/onnxruntime

Set `GraphResolveNeeded(true)` after adding a node

Description: Adds a call to GraphResolveNeeded(true) in Graph::AddNode().

Motivation and Context

  • At present, after calling Graph::AddNode() (e.g. in an optimization pass) the Node::op_ field for the added nodes may be nullptr. This can cause segfaults in subsequent code that accesses this field.
+2 -1

0 comment

1 changed file

pr created time in a month

create barnchmicrosoft/onnxruntime

branch : mrry-patch-1

created branch time in a month

pull request commentmicrosoft/onnxruntime

Use single thread when pipeline is not enabled in TrainingRunner

Hi Wei-Sheng! Is this PR blocked on anything? Would be good to get it in before more conflicts creep in :).

wschin

comment created time in 2 months

push eventmicrosoft/onnxruntime

Derek Murray

commit sha 3e48ffd21c4c924010b1a50a6de785a9486a790b

Move AutoPadType to common.h (#4474) Extracting some common code related to "AutoPadType" from the cpu execution provider into "common.h". Motivation and Context * Sharing code with authors of other execution providers that need the same functionality. * I didn't modify the code in shared_library or dnnl EP to avoid changing their dependency structure, so there is still a redundant copy of the AutoPadType code in there.

view details

push time in 2 months

PR merged microsoft/onnxruntime

[WIP] Move AutoPadType to common.h

Description: Extracting some common code related to "AutoPadType" from the cpu execution provider into "common.h".

Motivation and Context

  • Sharing code with authors of other execution providers that need the same functionality.
  • I didn't modify the code in shared_library or dnnl EP to avoid changing their dependency structure, so there is still a redundant copy of the AutoPadType code in there.
+81 -89

0 comment

8 changed files

mrry

pr closed time in 2 months

Pull request review commentmicrosoft/onnxruntime

[WIP] Move AutoPadType to common.h

 inline float clamp(float v, float lo, float hi) {   return v; } +enum class AutoPadType {+  NOTSET = 0,+  VALID = 1,+  SAME_UPPER = 2,+  SAME_LOWER = 3,+};++inline AutoPadType StringToAutoPadType(const std::string& str) {+  if (str.empty()) {+    return AutoPadType::NOTSET;+  }+  if (str == "NOTSET") {  // in onnx spec, default value is "NOTSET"+    return AutoPadType::NOTSET;+  }+  if (str == "VALID") {+    return AutoPadType::VALID;+  }+  if (str == "SAME_UPPER") {+    return AutoPadType::SAME_UPPER;+  }+  if (str == "SAME_LOWER") {+    return AutoPadType::SAME_LOWER;+  }+  ORT_ENFORCE(false, "Unknown AutoPadType String");+}++// helper function+template <bool ForceSymmetricAutoPadding>+Status ComputePadAndOutputShape(

That sounds like a reasonable idea! I'm going to go ahead and merge this change as is to keep it small, but I'd support the idea of splitting the functions in the future.

mrry

comment created time in 2 months

push eventmrry/onnxruntime

Derek Murray

commit sha 84d70539aed9015373eb9bea1fd920e2fdf18209

Revert dnnl code to the original version.

view details

push time in 2 months

push eventmrry/onnxruntime

Derek Murray

commit sha 75aea8cd2073595229cd26fe8d37fc44685ef88f

Completely inline roundUpPow2 and fix typo.

view details

push time in 2 months

push eventmrry/onnxruntime

Derek Murray

commit sha 41fe516f95fcb4c039134a167e3c42d071f25222

Instead of including math.h, inline the missing function.

view details

push time in 2 months

push eventmrry/onnxruntime

Derek Murray

commit sha 3c2c1ff11c550fed419466b1f5db2c90b9a1c2b2

Add missing include for math.h

view details

push time in 2 months

PR opened microsoft/onnxruntime

[WIP] Move AutoPadType to common.h

Description: Extracting some common code related to "AutoPadType" from the cpu and dnnl execution providers into "common.h".

Motivation and Context

  • Sharing code with authors of other execution providers that need the same functionality.
+80 -136

0 comment

9 changed files

pr created time in 3 months

create barnchmrry/onnxruntime

branch : autopadtype

created branch time in 3 months

pull request commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

Thanks for making those changes! I've added @tlh20 as a reviewer, so that he can take a look as well.

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void RetrieveSendRecvOperators(   } } -TEST(GradientGraphBuilderTest, PipelineOnlinePartition) {+TEST(GradientGraphBuilderTest, PipelineOnlinePartition_bert_tiny) {+  const auto model_path = ORT_TSTR("testdata/bert-tiny.onnx");++  const size_t total_partition_count = 3;+  TrainingSession::TrainingConfiguration::PipelineConfiguration pipe{};+  pipe.do_partition = true;++  // evenly model in 3 partitions+  TrainingSession::TrainingConfiguration::CutInfo cut0 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("186"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"273"})};++  TrainingSession::TrainingConfiguration::CutInfo cut1 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("308"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"395"})};++  pipe.cut_list.emplace_back(cut0);+  pipe.cut_list.emplace_back(cut1);++  TrainingSession::TrainingConfiguration::MixedPrecisionConfiguration mixed_precision_config{};+  mixed_precision_config.use_fp16_initializers = true;++  // 2 test variations - full precision and mixed precision+  const std::vector<bool> test_with_fp32{true, false};+  for (auto is_fp32 : test_with_fp32) {+    // graph is partitioned into 4 parts.+    for (int i = 0; i < static_cast<int>(total_partition_count); ++i) {+#ifdef _WIN32+      auto surfix = std::to_wstring(i);

Typo: surfix -> suffix.

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void RetrieveSendRecvOperators(   } } -TEST(GradientGraphBuilderTest, PipelineOnlinePartition) {+TEST(GradientGraphBuilderTest, PipelineOnlinePartition_bert_tiny) {+  const auto model_path = ORT_TSTR("testdata/bert-tiny.onnx");++  const size_t total_partition_count = 3;+  TrainingSession::TrainingConfiguration::PipelineConfiguration pipe{};+  pipe.do_partition = true;++  // evenly model in 3 partitions+  TrainingSession::TrainingConfiguration::CutInfo cut0 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("186"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"273"})};++  TrainingSession::TrainingConfiguration::CutInfo cut1 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("308"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"395"})};++  pipe.cut_list.emplace_back(cut0);+  pipe.cut_list.emplace_back(cut1);++  TrainingSession::TrainingConfiguration::MixedPrecisionConfiguration mixed_precision_config{};+  mixed_precision_config.use_fp16_initializers = true;++  // 2 test variations - full precision and mixed precision+  const std::vector<bool> test_with_fp32{true, false};+  for (auto is_fp32 : test_with_fp32) {+    // graph is partitioned into 4 parts.

3 parts?

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void RetrieveSendRecvOperators(   } } -TEST(GradientGraphBuilderTest, PipelineOnlinePartition) {+TEST(GradientGraphBuilderTest, PipelineOnlinePartition_bert_tiny) {+  const auto model_path = ORT_TSTR("testdata/bert-tiny.onnx");++  const size_t total_partition_count = 3;+  TrainingSession::TrainingConfiguration::PipelineConfiguration pipe{};+  pipe.do_partition = true;++  // evenly model in 3 partitions+  TrainingSession::TrainingConfiguration::CutInfo cut0 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("186"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"273"})};++  TrainingSession::TrainingConfiguration::CutInfo cut1 = {+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("308"),+      onnxruntime::training::TrainingSession::TrainingConfiguration::CutEdge("71", {"395"})};++  pipe.cut_list.emplace_back(cut0);+  pipe.cut_list.emplace_back(cut1);++  TrainingSession::TrainingConfiguration::MixedPrecisionConfiguration mixed_precision_config{};+  mixed_precision_config.use_fp16_initializers = true;++  // 2 test variations - full precision and mixed precision+  const std::vector<bool> test_with_fp32{true, false};+  for (auto is_fp32 : test_with_fp32) {+    // graph is partitioned into 4 parts.+    for (int i = 0; i < static_cast<int>(total_partition_count); ++i) {+#ifdef _WIN32+      auto surfix = std::to_wstring(i);+#else+      auto surfix = std::to_string(i);+#endif+      PathString output_file = ORT_TSTR("pipeline_partition_") + surfix + ORT_TSTR("_back.onnx");++      auto config = MakeBasicTrainingConfig();++      if (i == static_cast<int>(total_partition_count - 1)) {+        config.loss_function_config = TrainingSession::TrainingConfiguration::LossFunctionConfiguration{};+        config.loss_function_config.value().loss_function_info =+            LossFunctionInfo(OpDef("BertLoss", kOnnxDomain),+                             "total_loss",+                             {/*prediction_masked_lm*/ "output1",+                              /*prediction_next_sentence*/ "output2",+                              /*masked_lm_positions*/ "masked_lm_positions",+                              /*masked_lm_ids*/ "masked_lm_ids",+                              /*masked_lm_weights*/ "masked_lm_weights",+                              /*next_sentence_labels*/ "next_sentence_labels",+                              /*mlm_loss*/ "mlm_loss",+                              /*nsp_loss*/ "nsp_loss"});+      }++      config.immutable_weights = {+          {"Div", {{1, 8.0f}, {1, 1.4142135381698608f}}},+          {"Add", {{1, 1.0f}, {1, 9.999999960041972e-13f}}},+          {"Mul", {{1, 0.5f}, {1, -10000.0f}}},+          {"Sub", {{0, 1.0f}}}};

What are these magic numbers? (I see similar code elsewhere in the codebase, but I'm not sure what they have to do with the pipelining code....)

Are these related to initializers, or to weight_names_not_to_train in the TrainingConfiguration? Do we/should we have a test for the case where two pipeline stages depend on the same immutable weight? (Presumably in that case we'd replicate the initializer rather than sending it down the pipeline on each step.)

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void RetrieveSendRecvOperators(   } } -TEST(GradientGraphBuilderTest, PipelineOnlinePartition) {+TEST(GradientGraphBuilderTest, PipelineOnlinePartition_bert_tiny) {

Is it possible to have some smaller unit tests for this functionality, to ensure that we have full coverage for the code paths in the transformation? Off the top of my head, it would be good to see how we handle multiple shared initializers with different usage patterns, and also different usage patterns (e.g. stages that do all three of receiving, using, and sending a value).

The end-to-end test here is nice to have, but it would be good to have smaller tests to confirm correctness as we modify the code.

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id+    // than the one in send+    auto& new_node_arg = CreateNodeArg(graph, current_node_arg);+    new_node_args.push_back(&new_node_arg);+    current_node_arg = &new_node_arg;++    // process recv node in cut i+    auto& recv_attributes = recv_nodes[i]->GetMutableAttributes();+    auto& recv_element_types = recv_attributes["element_types"];+    recv_element_types.add_ints(static_cast<int64_t>(dtype));+    recv_nodes[i]->MutableOutputDefs().push_back(current_node_arg);+  }++  // update the consumer node's input if the node's group is not in the first partition+  for (size_t i = 1u; i < node_groups.size(); ++i) {+    ORT_ENFORCE(node_groups[i].stage_id > from_stage, "node group id (", node_groups[i].stage_id,+                ") is less than first stage id (", from_stage, "). ");+    size_t new_node_arg_index = node_groups[i].stage_id - from_stage - 1;+    for(auto node : node_groups[i].nodes){+      for (auto& input_node : node->MutableInputDefs()) {+        if (input_node == initializer) {+          input_node = new_node_args[new_node_arg_index];+          break;+        }+      }+    }+  }+  return Status::OK();+}++void TraverseGraphWithConnectedElement(Graph& graph,+                                       const Node* startNode,+                                       std::set<const Node*>& visited_nodes,+                                       std::set<const NodeArg*>& visited_inputs,+                                       std::set<const NodeArg*>& visited_outputs) {+  visited_nodes.clear();+  visited_inputs.clear();+  visited_outputs.clear();++  std::queue<const Node*> node_queue;+  node_queue.push(startNode);++  while (!node_queue.empty()) {+    auto node = node_queue.front();+    node_queue.pop();+    if (visited_nodes.count(node) == 0) {+      visited_nodes.insert(node);+      std::vector<const Node*> connected_nodes;+      ORT_THROW_IF_ERROR(FindAllConnectedNodes(graph, node, connected_nodes, visited_inputs, visited_outputs));++      for (auto n : connected_nodes) {+        ORT_ENFORCE(n != nullptr, "Found nullptr in searching for connected nodes");+        node_queue.push(n);+      }+    }+  }+}++// If an initializer is shared across partitions, instead of creating a separate all_reduce op to+// sync with those tensors in selected partitions, we save only one copy of that initializer in+// the very first partition it appears, and pass that data down to all following partitions+// where this initializer is used.+common::Status HandleSharedInitializer(Graph& graph,+                                       const std::vector<Node*>& send_nodes,+                                       const std::vector<Node*>& recv_nodes) {+  // Map an given initializer to all the partitions that its consumer nodes reside. The size of+  // the mapped vector reflects how many partitions this initializer's consumer nodes distribute.+  // If its size is greater than 1, it means this initializer is being used in more than one partition and+  // we need to proceed those cases.+  std::map<const NodeArg*, std::vector<PipelineStageNodeGroup>> input_consumer_stage_map;++  for (size_t stage = 0; stage <= send_nodes.size(); ++stage) {+    std::set<const Node*> visited_nodes;+    std::set<const NodeArg*> visited_inputs;+    std::set<const NodeArg*> visited_outputs;++    if (stage < send_nodes.size()) {+      TraverseGraphWithConnectedElement(graph, send_nodes[stage],+                                        visited_nodes, visited_inputs, visited_outputs);+    } else {+      TraverseGraphWithConnectedElement(graph, recv_nodes.back(),+                                        visited_nodes, visited_inputs, visited_outputs);+    }++    for (const auto input : visited_inputs) {+      // If the node is an input instead of an initializer, continue+      if (!graph.IsInitializerTensor(input->Name())){+        continue;+      }++      // group all consumer nodes that shares the same input initializer in visited_consumer_nodes+      std::vector<Node*> consumer_nodes = graph.GetMutableConsumerNodes(input->Name());+      std::vector<Node*> visited_consumer_nodes;+      for(auto consumer_node : consumer_nodes){+        if (visited_nodes.count(consumer_node) != 0){+          visited_consumer_nodes.push_back(consumer_node);+        }+      }++      if (input_consumer_stage_map.count(input) == 0) {+        std::vector<PipelineStageNodeGroup> stage_node_group{PipelineStageNodeGroup(stage, visited_consumer_nodes)};+        input_consumer_stage_map[input] = std::move(stage_node_group);+      } else {+        input_consumer_stage_map[input].push_back({stage, visited_consumer_nodes});+      }+    }+  }++  for (const auto entry : input_consumer_stage_map) {

Can this be const auto& entry?

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id+    // than the one in send+    auto& new_node_arg = CreateNodeArg(graph, current_node_arg);+    new_node_args.push_back(&new_node_arg);+    current_node_arg = &new_node_arg;++    // process recv node in cut i+    auto& recv_attributes = recv_nodes[i]->GetMutableAttributes();+    auto& recv_element_types = recv_attributes["element_types"];+    recv_element_types.add_ints(static_cast<int64_t>(dtype));+    recv_nodes[i]->MutableOutputDefs().push_back(current_node_arg);+  }++  // update the consumer node's input if the node's group is not in the first partition+  for (size_t i = 1u; i < node_groups.size(); ++i) {+    ORT_ENFORCE(node_groups[i].stage_id > from_stage, "node group id (", node_groups[i].stage_id,+                ") is less than first stage id (", from_stage, "). ");+    size_t new_node_arg_index = node_groups[i].stage_id - from_stage - 1;+    for(auto node : node_groups[i].nodes){+      for (auto& input_node : node->MutableInputDefs()) {+        if (input_node == initializer) {+          input_node = new_node_args[new_node_arg_index];+          break;+        }+      }+    }+  }+  return Status::OK();+}++void TraverseGraphWithConnectedElement(Graph& graph,+                                       const Node* startNode,

Style nit: use start_node instead of camelCase startNode for consistency.

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id+    // than the one in send+    auto& new_node_arg = CreateNodeArg(graph, current_node_arg);+    new_node_args.push_back(&new_node_arg);+    current_node_arg = &new_node_arg;++    // process recv node in cut i+    auto& recv_attributes = recv_nodes[i]->GetMutableAttributes();+    auto& recv_element_types = recv_attributes["element_types"];+    recv_element_types.add_ints(static_cast<int64_t>(dtype));+    recv_nodes[i]->MutableOutputDefs().push_back(current_node_arg);+  }++  // update the consumer node's input if the node's group is not in the first partition+  for (size_t i = 1u; i < node_groups.size(); ++i) {+    ORT_ENFORCE(node_groups[i].stage_id > from_stage, "node group id (", node_groups[i].stage_id,+                ") is less than first stage id (", from_stage, "). ");+    size_t new_node_arg_index = node_groups[i].stage_id - from_stage - 1;+    for(auto node : node_groups[i].nodes){+      for (auto& input_node : node->MutableInputDefs()) {+        if (input_node == initializer) {+          input_node = new_node_args[new_node_arg_index];+          break;+        }+      }+    }+  }+  return Status::OK();+}++void TraverseGraphWithConnectedElement(Graph& graph,+                                       const Node* startNode,+                                       std::set<const Node*>& visited_nodes,+                                       std::set<const NodeArg*>& visited_inputs,+                                       std::set<const NodeArg*>& visited_outputs) {+  visited_nodes.clear();+  visited_inputs.clear();+  visited_outputs.clear();++  std::queue<const Node*> node_queue;+  node_queue.push(startNode);++  while (!node_queue.empty()) {+    auto node = node_queue.front();+    node_queue.pop();+    if (visited_nodes.count(node) == 0) {+      visited_nodes.insert(node);+      std::vector<const Node*> connected_nodes;+      ORT_THROW_IF_ERROR(FindAllConnectedNodes(graph, node, connected_nodes, visited_inputs, visited_outputs));++      for (auto n : connected_nodes) {+        ORT_ENFORCE(n != nullptr, "Found nullptr in searching for connected nodes");+        node_queue.push(n);+      }+    }+  }+}++// If an initializer is shared across partitions, instead of creating a separate all_reduce op to+// sync with those tensors in selected partitions, we save only one copy of that initializer in+// the very first partition it appears, and pass that data down to all following partitions+// where this initializer is used.+common::Status HandleSharedInitializer(Graph& graph,+                                       const std::vector<Node*>& send_nodes,+                                       const std::vector<Node*>& recv_nodes) {+  // Map an given initializer to all the partitions that its consumer nodes reside. The size of+  // the mapped vector reflects how many partitions this initializer's consumer nodes distribute.+  // If its size is greater than 1, it means this initializer is being used in more than one partition and+  // we need to proceed those cases.+  std::map<const NodeArg*, std::vector<PipelineStageNodeGroup>> input_consumer_stage_map;++  for (size_t stage = 0; stage <= send_nodes.size(); ++stage) {+    std::set<const Node*> visited_nodes;+    std::set<const NodeArg*> visited_inputs;+    std::set<const NodeArg*> visited_outputs;++    if (stage < send_nodes.size()) {+      TraverseGraphWithConnectedElement(graph, send_nodes[stage],+                                        visited_nodes, visited_inputs, visited_outputs);+    } else {+      TraverseGraphWithConnectedElement(graph, recv_nodes.back(),+                                        visited_nodes, visited_inputs, visited_outputs);+    }++    for (const auto input : visited_inputs) {+      // If the node is an input instead of an initializer, continue+      if (!graph.IsInitializerTensor(input->Name())){+        continue;+      }++      // group all consumer nodes that shares the same input initializer in visited_consumer_nodes+      std::vector<Node*> consumer_nodes = graph.GetMutableConsumerNodes(input->Name());+      std::vector<Node*> visited_consumer_nodes;+      for(auto consumer_node : consumer_nodes){+        if (visited_nodes.count(consumer_node) != 0){+          visited_consumer_nodes.push_back(consumer_node);+        }+      }++      if (input_consumer_stage_map.count(input) == 0) {+        std::vector<PipelineStageNodeGroup> stage_node_group{PipelineStageNodeGroup(stage, visited_consumer_nodes)};+        input_consumer_stage_map[input] = std::move(stage_node_group);

Would it work to do this in a single line:

input_consumer_stage_map[input] = {PipelineStageNodeGroup(stage, visited_consumer_nodes};

...or is there a C++ type inference problem that prevents this? :)

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id+    // than the one in send+    auto& new_node_arg = CreateNodeArg(graph, current_node_arg);+    new_node_args.push_back(&new_node_arg);+    current_node_arg = &new_node_arg;++    // process recv node in cut i+    auto& recv_attributes = recv_nodes[i]->GetMutableAttributes();+    auto& recv_element_types = recv_attributes["element_types"];+    recv_element_types.add_ints(static_cast<int64_t>(dtype));+    recv_nodes[i]->MutableOutputDefs().push_back(current_node_arg);+  }++  // update the consumer node's input if the node's group is not in the first partition+  for (size_t i = 1u; i < node_groups.size(); ++i) {+    ORT_ENFORCE(node_groups[i].stage_id > from_stage, "node group id (", node_groups[i].stage_id,+                ") is less than first stage id (", from_stage, "). ");+    size_t new_node_arg_index = node_groups[i].stage_id - from_stage - 1;+    for(auto node : node_groups[i].nodes){+      for (auto& input_node : node->MutableInputDefs()) {+        if (input_node == initializer) {+          input_node = new_node_args[new_node_arg_index];+          break;+        }+      }+    }+  }+  return Status::OK();+}++void TraverseGraphWithConnectedElement(Graph& graph,+                                       const Node* startNode,+                                       std::set<const Node*>& visited_nodes,+                                       std::set<const NodeArg*>& visited_inputs,+                                       std::set<const NodeArg*>& visited_outputs) {+  visited_nodes.clear();+  visited_inputs.clear();+  visited_outputs.clear();++  std::queue<const Node*> node_queue;+  node_queue.push(startNode);++  while (!node_queue.empty()) {+    auto node = node_queue.front();+    node_queue.pop();+    if (visited_nodes.count(node) == 0) {+      visited_nodes.insert(node);+      std::vector<const Node*> connected_nodes;+      ORT_THROW_IF_ERROR(FindAllConnectedNodes(graph, node, connected_nodes, visited_inputs, visited_outputs));++      for (auto n : connected_nodes) {+        ORT_ENFORCE(n != nullptr, "Found nullptr in searching for connected nodes");+        node_queue.push(n);+      }+    }+  }+}++// If an initializer is shared across partitions, instead of creating a separate all_reduce op to+// sync with those tensors in selected partitions, we save only one copy of that initializer in+// the very first partition it appears, and pass that data down to all following partitions+// where this initializer is used.+common::Status HandleSharedInitializer(Graph& graph,+                                       const std::vector<Node*>& send_nodes,+                                       const std::vector<Node*>& recv_nodes) {+  // Map an given initializer to all the partitions that its consumer nodes reside. The size of

Typo: an -> a

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id+    // than the one in send+    auto& new_node_arg = CreateNodeArg(graph, current_node_arg);+    new_node_args.push_back(&new_node_arg);+    current_node_arg = &new_node_arg;++    // process recv node in cut i+    auto& recv_attributes = recv_nodes[i]->GetMutableAttributes();+    auto& recv_element_types = recv_attributes["element_types"];+    recv_element_types.add_ints(static_cast<int64_t>(dtype));+    recv_nodes[i]->MutableOutputDefs().push_back(current_node_arg);+  }++  // update the consumer node's input if the node's group is not in the first partition+  for (size_t i = 1u; i < node_groups.size(); ++i) {+    ORT_ENFORCE(node_groups[i].stage_id > from_stage, "node group id (", node_groups[i].stage_id,+                ") is less than first stage id (", from_stage, "). ");+    size_t new_node_arg_index = node_groups[i].stage_id - from_stage - 1;+    for(auto node : node_groups[i].nodes){+      for (auto& input_node : node->MutableInputDefs()) {+        if (input_node == initializer) {+          input_node = new_node_args[new_node_arg_index];+          break;+        }+      }+    }+  }+  return Status::OK();+}++void TraverseGraphWithConnectedElement(Graph& graph,+                                       const Node* startNode,+                                       std::set<const Node*>& visited_nodes,+                                       std::set<const NodeArg*>& visited_inputs,+                                       std::set<const NodeArg*>& visited_outputs) {+  visited_nodes.clear();+  visited_inputs.clear();+  visited_outputs.clear();++  std::queue<const Node*> node_queue;+  node_queue.push(startNode);++  while (!node_queue.empty()) {+    auto node = node_queue.front();+    node_queue.pop();+    if (visited_nodes.count(node) == 0) {+      visited_nodes.insert(node);

Nit: It's more efficient to write this as:

if (visited_nodes.insert(node).second) {
xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);

Is it possible that node_groups could be empty here (and so node_groups[0] could be out of bounds)? (I'm not sure if this error check is more of an assertion/internal integrity check, or if the user can specify invalid input that would trigger this check.)

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};+};++common::Status AddPassthroughInitializer(Graph& graph,+                                         const NodeArg* initializer,+                                         const std::vector<PipelineStageNodeGroup>& node_groups,+                                         const std::vector<Node*>& send_nodes,+                                         const std::vector<Node*>& recv_nodes) {+  ORT_ENFORCE(node_groups.size() >= 2, "Initializer ", initializer->Name(),+              " is not shared across stages. It only exits in partition: ", node_groups[0].stage_id);++  const size_t from_stage = node_groups.front().stage_id;+  const size_t to_stage = node_groups.back().stage_id;++  ORT_ENFORCE(from_stage < to_stage, "Pass through from_stage (", from_stage,+              ") is not less than the to_stage (", to_stage, ").");++  auto dtype = initializer->TypeAsProto()->tensor_type().elem_type();++  // new_node_args tracks newly created node_args in the pass through stages+  std::vector<NodeArg*> new_node_args;+  auto current_node_arg = const_cast<NodeArg*>(initializer);++  for (auto i = from_stage; i < to_stage; ++i) {+    // processing send node in cut i+    auto& send_attributes = send_nodes[i]->GetMutableAttributes();+    auto& send_element_types = send_attributes["element_types"];+    send_element_types.add_ints(static_cast<int64_t>(dtype));+    send_nodes[i]->MutableInputDefs().push_back(current_node_arg);+    send_nodes[i]->MutableInputArgsCount().back()++;++    // Create a new node_arg for the recv, as the new node_arg from recv node should possess a differnet id

Type: differnet -> different.

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";

Can you please delete the commented-out logging code?

xzhu1900

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

support pipeline partition with shared initializer

 void AddNewScalarNodeArgAndInitializer(Graph& graph,   graph.AddInitializedTensor(proto_data); } +Status FindAllConnectedNodes(Graph& graph,+                             const Node* node,+                             std::vector<const Node*>& connected_nodes,+                             std::set<const NodeArg*>& connected_inputs,+                             std::set<const NodeArg*>& connected_outputs+                             ) {+  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->InputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (graph.IsInputsIncludingInitializers(&node_arg)) {+          connected_inputs.insert(&node_arg);+        } else {+          const Node* producer_node = graph.GetProducerNode(node_arg.Name());+          if (producer_node == nullptr) {+            // got nullptr as producer node. This could be because the input is a constant op which will be optimized+            // away. Print out this information and continue.+            // LOGS_DEFAULT(WARNING) << "Cannot find producer node for node_arg: " << node_arg.Name() << ". Skipping this node.";+          } else {+            connected_nodes.push_back(producer_node);+          }+        }+        return Status::OK();+      }));++  ORT_THROW_IF_ERROR(node->ForEachWithIndex(+      node->OutputDefs(),+      [&](const NodeArg& node_arg, size_t /*index*/) {+        if (!graph.IsOutput(&node_arg)) {+          std::vector<const Node*> consumer_nodes = graph.GetConsumerNodes(node_arg.Name());+          connected_nodes.insert(std::end(connected_nodes), consumer_nodes.begin(), consumer_nodes.end());++        } else {+          connected_outputs.insert(&node_arg);+        }+        return Status::OK();+      }));+  return Status::OK();+}++// PipelineStageNodeGroup groups nodes that share the same input initializer and belong to the same stage.+// It is used to distinguish other nodes that share the same input initializer but belong to+// other pipeline partitions after split.+struct PipelineStageNodeGroup {+  const size_t stage_id;++  // Vector of nodes that have the same initializer input and belong to the same stage. Noted that+  // the consumer nodes of a particular initializer can be more than one, so we need a vector to store those+  // nodes.+  std::vector<Node*> nodes;+  PipelineStageNodeGroup(const size_t stage, std::vector<Node*>& node) : stage_id(stage), nodes(std::move(node)){};

Nit: Looks like a missing space before {. Can you please run clang-format again?

(Relatedly, it looks like a lot of the changes in the diff are from formatting. I'm not sure what the team convention is for formatting changes, and whether it should be a separate commit or not. Hopefully we could have something that enforces auto-formatting :).)

xzhu1900

comment created time in 3 months

issue commentmicrosoft/onnxruntime

run(output_names, input_feed, run_options)

The documentation for creating an onnxruntime.RunOptions object, which can be passed as the run_options parameter is here:

https://microsoft.github.io/onnxruntime/python/api_summary.html#onnxruntime.RunOptions

However, I don't think that is the source of your error. This message:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running SkipLayerNormalization node. Name:'SkipLayerNorm_AddBias_25' Status Message: input is expected to have 3 dimensions, got 2

...indicates that there is a shape error somewhere in the model. It is most likely that one of the arrays in inputs has the wrong shape.

Tony1236

comment created time in 3 months

Pull request review commentmicrosoft/onnxruntime

Add Optimizer classes to the new frontend

+class _OptimizerConfig(object):+    r"""Base class for optimizer configuration++    This private class is not an optimizer, but a means to configure existing ones from ORT backend.+    Once the optimizer is configured, no user intervention is needed to update weights or zero gradients during training.+    The 'parameter group' was inspired by `Pytorch <https://pytorch.org/docs/stable/optim.html#per-parameter-options>`_.++    Args:+        name (str): optimizer names.+            One of 'SGDOptimizer', 'AdamOptimizer' and 'LambOptimizer'+        hyper_parameters (dict): optimizer hyper-parameters applied to all model parameters.+                                 Every optimizer must have a 'lr' entry on this dictionary.+        param_groups (list of dict, default is []): list of parameters groups.+            Each dict must contain a 'params' key with a list of model parameters that will+            be optimized with the group's custom hyper-parameters values.+            In other words, parameter groups override the default :py:attr:`.hyper_parameters` for specific model parameters++    Example:++    .. code-block:: python++        lamb_optim = _OptimizerConfig(name = 'LambOptimizer',+                                    hyper_parameters = {'lr': 0.001, 'alpha' : 0.01, 'beta' : 0.9},+                                    param_groups = [ { 'params' : ['model_param_0', 'model_param1'],+                                                       'epsilon' : 0.03, 'beta' : 0.5},+                                                     { 'params' : ['model_param_2'],+                                                       'alpha' : 0.04},+                                                   ]+                    )+    """++    def __init__(self, name, hyper_parameters, param_groups=[]):+        assert isinstance(name, str), "'name' must be a string"+        assert name in ['AdamOptimizer', 'LambOptimizer', 'SGDOptimizer'], \+            "'name' must be one of 'AdamOptimizer', 'LambOptimizer' or 'SGDOptimizer'"+        assert isinstance(hyper_parameters,+                          dict), "'hyper_parameters' must be a dict"+        assert 'lr' in hyper_parameters, "'hyper_parameters' must contain a {'lr' : positive number} entry"+        assert (isinstance(hyper_parameters['lr'], float) or+                isinstance(hyper_parameters['lr'], int)) and hyper_parameters['lr'] >= 0, "lr must be a positive number"+        assert isinstance(param_groups, list), "'param_groups' must be a list"+        for group in param_groups:+            assert isinstance(group, dict) and len(group) > 1 and 'params' in group, \+                ("Each dict inside 'param_groups' must contain a {'params' : [model parameter names]} entry"+                 " and additional entries for custom hyper parameter values")+            for k, _ in group.items():+                if k != 'params':+                    assert k in hyper_parameters, f"'param_groups' has 'k' hyper parameter not present at 'hyper_parameters'"++        self.name = name+        self.lr = hyper_parameters['lr']+        self.hyper_parameters = hyper_parameters+        self.param_groups = []++        # TODO: monitor this for perf issues+        # Maybe we don't have to do this to populate TrainingParameters,+        # but it does make code easier to maintain+        for param_group in param_groups:+            self._add_param_group(param_group)++    def _add_param_group(self, param_group):+        r"""Add a parameter group to the :py:class:`_OptimizerConfig` s `param_groups`."""+        assert isinstance(param_group, dict), "param group must be a dict"++        # Each parameter group must have all hyper parameters set+        for name, value in self.hyper_parameters.items():+            if name not in param_group:+                param_group.setdefault(name, value)++        self.param_groups.append(param_group)+++class SGD(_OptimizerConfig):

I have a slight preference for optim.SGDConfig rather than the nested namespace (to defend against ambiguity if the user does a wildcard import or renames the imported module).

thiagocrepaldi

comment created time in 3 months

more