profile
viewpoint

tensorflow/models 66361

Models and examples built with TensorFlow

tensorflow/benchmarks 871

A benchmark framework for Tensorflow

annarev/tensorflow 1

Computation using data flow graphs for scalable machine learning

annarev/benchmarks 0

Benchmark code

annarev/community 0

Stores documents used by the TensorFlow developer community

annarev/convnet-benchmarks 0

Easy benchmarking of all publicly accessible implementations of convnets

annarev/models 0

Models built with TensorFlow

annarev/tensorboard 0

TensorFlow's Visualization Toolkit

issue commenttensorflow/tensorflow

Is SYCL on Tensorflow officially dead?

Hi Anna, What is the difference between https://cs.opensource.google/tensorflow/tensorflow/ https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/experimental/stream_executor/stream_executor.h?q=stream_executor.h and https://github.com/tensorflow ? Thank you Kevin Anna Revinskaya notifications@github.com 于2020年9月22日周二 上午11:50写道: @rodburns https://github.com/rodburns I would be happy to help out with integrating with the new API. That being said, it is still work-in-progress but you can check RFCs @sanjoy https://github.com/sanjoy linked and let us know if you have any questions. Current StreamExecutor C API is checked in here: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/experimental/stream_executor/stream_executor.h?q=stream_executor.h (note that Pluggable Device needs to be implemented first before this API is useful. Also, the API might still have some changes coming up.) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#41809 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDERVDMVRHRXZU5Z7HTPGDSHANJNANCNFSM4PKKQEIQ .

https://github.com/tensorflow is the actual repo that contains TF code. cs.opensource.google just provides a UI to browse the TF github code.

kevint324

comment created time in 4 days

issue commenttensorflow/tensorflow

Is SYCL on Tensorflow officially dead?

@rodburns I would be happy to help out with integrating with the new API. That being said, it is still work-in-progress but you can check RFCs @sanjoy linked and let us know if you have any questions. Current StreamExecutor C API is checked in here: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/experimental/stream_executor/stream_executor.h?q=stream_executor.h (note that Pluggable Device needs to be implemented first before this API is useful. Also, the API might still have some changes coming up.)

kevint324

comment created time in 5 days

pull request commenttensorflow/community

RFC: StreamExecutor C API

I updated the doc to add a way to use a custom allocator: https://github.com/tensorflow/community/commit/21be6e9065d5c3d8bc811be3f8e887e87276931f Addition of a custom allocator was requested in Pluggable Device RFC: https://github.com/tensorflow/community/pull/262

annarev

comment created time in 9 days

pull request commenttensorflow/community

RFC: Adding Pluggable Device For TensorFlow

thanks @annarev , I think unified_memory_allocate/unified_memory_deallocate is still need, AllocatorHandle is a higher level api, which allows plugin authors to implement a specific cached allocator, which is counter part of BFCAllocator. But for those plugin authors who want to reuse existing tensorflow BFC allocator, I think the low level allocation api in StreamExecutor(AllocateArray/UnifiedMemoryAllocate) is still need, We can make it: if plugin register a allocator handle, then proper will use allocate handle as its allocator, it not , BFC Allocator will be the default Allocator and it will be forward to StreamExecutor Allocation API through sub-allocator.

I wonder if it will get confusing with 2 separate ways to set allocator functionality. Need to think of a clear way to annotate which one to use when in the API. But I will try it out and see how it looks.

I added a proposed implementation here: https://github.com/tensorflow/community/pull/257/commits/21be6e9065d5c3d8bc811be3f8e887e87276931f This adds 2 sets of functions to SP_PlatformFns create_allocator/destroy_allocator and create_custom_allocator/destroy_custom_allocator. I thought keeping the two options near each other would make it less confusing. @kulinseth, @jzhoulon ptal.

jzhoulon

comment created time in 10 days

push eventannarev/community

Anna Revinskaya

commit sha 21be6e9065d5c3d8bc811be3f8e887e87276931f

Custom and default allocator added

view details

push time in 10 days

issue commenttensorflow/tensorflow

tf.keras.callbacks.TensorBoard histogram summary breaks on tf.bool layer weights

I am not really familiar with WriteHistogramSummary op. Adding +@alextp who might know more about that op.

unsatcore

comment created time in 11 days

pull request commenttensorflow/community

RFC: TFRT Kernel Fallback

@ematejska This RFC is now ready to accept and merge.

annarev

comment created time in 12 days

pull request commenttensorflow/community

RFC: StreamExecutor C API

I am planning to add the following SP_DeviceDescription struct. I made some fields optional and some required. @wchao1115, @kulinseth, @jzhoulon let me know if this looks reasonable (for e.g. if some other fields should be optional as well, such as pci_bus_id or if some optional fields should be required).

Thanks!. This looks good to me. Just to confirm, all the versioning information is subsumed in struct_size, and we don't need any extra fields in DeviceDescription for version compatibility. Is that correct?

Yes, we rely on struct_size and version value in SE_PlatformRegistrationParams.

annarev

comment created time in 15 days

pull request commenttensorflow/community

RFC: TFRT Kernel Fallback

Yes I don't know all the internals but I suppose that It Is a little bit hard to isolate the compile time of a good sample of kernels. In general of course also ISOCPP itself claims an increased compile time of templates but we still need to figure out in our empirical case just to not be aware of this too late (when we have accumulated a significante amount of registered kernels).

Quick update. I ran build 3 times at head and 3 times with OpKernelContext and OpKernelConstruction updated to inherit from interfaces (our preferred approach right now is inheritance, so I didn't test templates). Time measurements at head (seconds): base = np.array([6776.767, 6595.033, 6926.461]) Time measurements when using inheritance (seconds): inheritance = np.array([7109.257, 6722.884, 6660.029]) p-value (calculated with stats.ttest_ind(inheritance, base)) is quite high at 0.72. Therefore, most likely there is no build time regression.

annarev

comment created time in 16 days

push eventannarev/community

Penporn Koanantakool

commit sha 22558548c56700a5ab3a38d6e937db1de2041bc1

Merge remote-tracking branch 'annarev/stream_executor_rfc'

view details

Penporn Koanantakool

commit sha 3bb5b4dca4ed67e79d42445b2b96d8b984bc064c

Update 20200612-stream-executor-c-api.md * Added the Implementation Conventions and Usage Overview section. * Replaced the Stability / User Impact with the Versioning Strategy and Stability section. Also moved the section higher. * Updated the API design according to the real implementation. * Updated code examples.

view details

Penporn Koanantakool

commit sha 9787cb128f421bf18b5bc29673c77f8f802c2265

Added/updated versioning contents based on the discussion on the PluggableDevice RFC (PR #262). * Referenced semver and TensorFlow's Version Compatibility page. * Added the Updating Guidelines, Convention, and Detecting Incompatibility sections. * Merged the current code examples with @yisitu's newer ones on PR #262 and simplified them a bit.

view details

Penporn Koanantakool

commit sha b3360a8959c5712f2e5ee5f2cac22e4a5e7b4700

Merge remote-tracking branch 'annarev/stream_executor_rfc'

view details

Penporn Koanantakool

commit sha d1ec35114f43ba66b04d4afb494f106d932f7bfe

Fixes according to annarev@'s comments. Also some more minor edits.

view details

Penporn Koanantakool

commit sha 1a6f2cf9a85e218caec00ec1ca5dd5789c07cced

Apply suggestions from code review Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>

view details

Penporn Koanantakool

commit sha 0218043d6019af519adc7acd87b10a571089e4d0

Apply suggestions from code review Co-authored-by: Situ Yi <60493024+yisitu@users.noreply.github.com>

view details

Penporn Koanantakool

commit sha cc0f721764de733951438f9664ea05b61bf62e47

More clean-ups. Changed "shall" to "must" to make sure people would not misinterpret "shall".

view details

Anna Revinskaya

commit sha 2c53747396d1d3917f0801678c8378bfe97f2116

Merge pull request #5 from penpornk/master Updates based on recent changes and discussion on RFC PR #262

view details

push time in 16 days

PR merged annarev/community

Updates based on recent changes and discussion on RFC PR #262

Changes to the Main RFC File

  • Added the Implementation Conventions and Usage Overview section.
  • Replaced the Stability / User Impact with the Versioning Strategy and Stability section. Also moved the section higher.
  • Updated the API design according to the real implementation.
  • Updated code examples.

Changes to the Versioning Strategy File:

Added/updated contents based on the discussion on the PluggableDevice RFC.

  • Referenced semver and TensorFlow's Version Compatibility page.
  • Added the Updating Guidelines, Convention, and Detecting Incompatibility sections.
  • Merged the current code examples with @yisitu's newer ones on PR #262 and simplified them a bit.

Specifically, addressed the following points in the comments on the PluggableDevice RFC:

  • Restructured the contents so the conventions/assumptions/procedures are clearer.
  • Showed that the struct size check works with struct members in any position, not just the last member of the struct.
  • Added more details on major version bump conventions. Specified that breaking changes require an RFC.
  • Explicitly mentioned that nontrivial deprecations (that cannot use zero initialization) needs a major version bump.
  • Added a requirement that fields that treat 0 and NULL as invalid values must have explicit comments saying so, to ensure all plug-ins have consistent behavior, e.g., none of them is using 0 or NULL for special cases.
  • Updated the deprecation examples to use 'producer' and 'consumer' keywords (instead of plug-in and core TensorFlow) to cover more cases.
+713 -440

1 comment

2 changed files

penpornk

pr closed time in 16 days

Pull request review commentannarev/community

Updates based on recent changes and discussion on RFC PR #262

 typedef struct SP_StreamExecutor {                            TF_Status* status);    // Causes the host code to synchronously wait for the event to complete.-  void (*block_host_for_event)(-      SP_Device* executor, SP_Event event, TF_Status* status);+  void (*block_host_for_event)(const SP_Device* device, SP_Event event,+                               TF_Status* status);++  // [Optional]+  // Causes the host code to synchronously wait for operations entrained onto+  // stream to complete. Effectively a join on the asynchronous device+  // operations enqueued on the stream before this program point.+  // If not set, then corresponding functionality will be implemented+  // by registering an event on the `stream` and waiting for it using+  // `block_host_for_event`.+  void (*block_host_until_done)(const SP_Device* device, SP_Stream stream,+                                TF_Status* status);    // Synchronizes all activity occurring in the StreamExecutor's context (most   // likely a whole device).-  TF_BOOL (*synchronize_all_activity)(SP_Device* executor);--  // Obtains metadata about the underlying device.-  void (*fill_device_description)(SP_Device* executor,-                                  SP_DeviceDescription* description,-                                  TF_Status* status);+  void (*synchronize_all_activity)(const SP_Device* device, TF_Status* status);    // Enqueues on a stream a user-specified function to be run on the host.-  TF_BOOL (*host_callback)(SP_Device* executor, SP_Stream* stream,-                           TF_StatusCallbackFn callback_fn, void* ctx);+  // `callback_arg` should be passed as the first argument to `callback_fn`.

All sound good to me. As a non-native speaker, I can't really tell the small differences between the three :) I just though will sounds like the argument will be passed by itself. However, it is plugin responsibility to make sure it is passed in.

penpornk

comment created time in 17 days

PullRequestReviewEvent

Pull request review commentannarev/community

Updates based on recent changes and discussion on RFC PR #262

 typedef struct SP_StreamExecutor {                            TF_Status* status);    // Causes the host code to synchronously wait for the event to complete.-  void (*block_host_for_event)(-      SP_Device* executor, SP_Event event, TF_Status* status);+  void (*block_host_for_event)(const SP_Device* device, SP_Event event,+                               TF_Status* status);++  // [Optional]+  // Causes the host code to synchronously wait for operations entrained onto+  // stream to complete. Effectively a join on the asynchronous device+  // operations enqueued on the stream before this program point.+  // If not set, then corresponding functionality will be implemented+  // by registering an event on the `stream` and waiting for it using+  // `block_host_for_event`.+  void (*block_host_until_done)(const SP_Device* device, SP_Stream stream,+                                TF_Status* status);    // Synchronizes all activity occurring in the StreamExecutor's context (most   // likely a whole device).-  TF_BOOL (*synchronize_all_activity)(SP_Device* executor);--  // Obtains metadata about the underlying device.-  void (*fill_device_description)(SP_Device* executor,-                                  SP_DeviceDescription* description,-                                  TF_Status* status);+  void (*synchronize_all_activity)(const SP_Device* device, TF_Status* status);    // Enqueues on a stream a user-specified function to be run on the host.-  TF_BOOL (*host_callback)(SP_Device* executor, SP_Stream* stream,-                           TF_StatusCallbackFn callback_fn, void* ctx);+  // `callback_arg` should be passed as the first argument to `callback_fn`.+  TF_Bool (*host_callback)(SP_Device* device, SP_Stream stream,+                           SE_StatusCallbackFn callback_fn, void* callback_arg); } SP_StreamExecutor; -#define SP_STREAMEXECUTOR_STRUCT_SIZE TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)+#define SP_STREAMEXECUTOR_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)++typedef struct SE_CreateStreamExecutorParams {+  size_t struct_size;+  void* ext;  // reserved for future use++  SP_StreamExecutor* stream_executor;  // output, to be filled by plugin+} SE_CreateStreamExecutorParams;++#define SE_CREATE_STREAM_EXECUTOR_PARAMS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SE_CreateStreamExecutorParams, stream_executor)  typedef struct SP_Platform {   size_t struct_size;-  -  // Free form data set by plugin.-  void* ext;-  -  // Platform name++  void* ext;  // free-form data set by plugin++  // Platform name. Must be null-terminated.   const char* name;-  size_t name_len;-  -  // Device type name, for example GPU.-  char* type;-  size_t type_len;-  -  // Callbacks for creating/destroying.-  void (*create_device)(-      SP_Device* device,  \\ out-      SE_Options* options, \\ in-      TF_Status* status);  \\ out-  void (*destroy_device)(SP_Device* device);-  -  // Callbacks for creating/destroying SE_StreamExecutor.-  void (*create_stream_executor)(-      SP_StreamExecutor*,  \\ out-      TF_Status* status);  \\ out-  void (*destroy_stream_executor)(SP_StreamExecutor* stream_executor);++  // Device type name, for example GPU. Must be null-terminated.+  const char* type;++  // Number of visible devices+  size_t visible_device_count;++  // Whether this platform supports unified memory.+  // Unified memory is a single memory address space accessible from any device.+  TF_Bool supports_unified_memory; } SP_Platform; -#define SP_PLATFORM_SIZE TF_OFFSET_OF_END(SP_Platform, destroy_stream_executor)+#define SP_PLATFORM_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_Platform, supports_unified_memory)++typedef struct SP_PlatformFns {+  size_t struct_size;++  void* ext;  // reserved for future use++  // Callbacks for creating/destroying SP_Device.+  void (*create_device)(const SP_Platform* platform,+                        SE_CreateDeviceParams* params, TF_Status* status);++  // Clean up fields inside SP_Device that were allocated+  // by the plugin. `device` itself should not be deleted here.+  void (*destroy_device)(const SP_Platform* platform, SP_Device* device);++  // Callbacks for creating/destroying SP_StreamExecutor.+  void (*create_stream_executor)(const SP_Platform* platform,+                                 SE_CreateStreamExecutorParams* params,+                                 TF_Status* status);+  // Clean up fields inside SP_StreamExecutor that were allocated+  // by the plugin. `stream_executor` itself should not be deleted here.+  void (*destroy_stream_executor)(const SP_Platform* platform,+                                  SP_StreamExecutor* stream_executor);++  // Callbacks for creating/destroying SP_TimerFns.+  void (*create_timer_fns)(const SP_Platform* platform, SP_TimerFns* timer,+                           TF_Status* status);++  void (*destroy_timer_fns)(const SP_Platform* platform,+                            SP_TimerFns* timer_fns);+} SP_PlatformFns;++#define SP_PLATFORM_FNS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_PlatformFns, destroy_timer_fns)  typedef struct SE_PlatformRegistrationParams {   size_t struct_size;-  void* ext;-  +  void* ext;  // reserved for future use+   // StreamExecutor C API version.   int32_t major_version;   int32_t minor_version;   int32_t patch_version;-  -  // Must be filled by the plugin.-  SP_Platform platform;  // out++  SP_Platform* platform;         // output, set by plugin+  SP_PlatformFns* platform_fns;  // output, set by plugin+  // Clean up fields inside SP_Platform that were allocated+  // by the plugin. `platform` itself should not be deleted here.+  void (*destroy_platform)(SP_Platform* platform);  // out, set by plugin+  void (*destroy_platform_fns)(+      SP_PlatformFns* platform_fns);  // out, set by plugin } SE_PlatformRegistrationParams; -#define SE_PLATFORM_REGISTRATION_PARAMS_SIZE TF_OFFSET_OF_END(SE_PlatformRegistrationParams, platform)+#define SE_PLATFORM_REGISTRATION_PARAMS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SE_PlatformRegistrationParams, destroy_platform_fns) -void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status);+void SE_InitPlugin(SE_PlatformRegistrationParams* params, TF_Status* status);  #ifdef __cplusplus-} // extern "C"+}  // extern "C" #endif ``` -## Registration implementation+### PlatformId -Registration will be implemented by registering a new StreamExecutor platform as well as a new TensorFlow device with [DeviceFactory](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/common_runtime/device_factory.h;l=30?q=DeviceFactory).+StreamExecutor [Platform](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/platform.h;l=114) has an id parameter. This parameter will be hidden from the C API and set+internally by TensorFlow instead. ++## Usage Example+Code example for [PluggableDevice](https://github.com/tensorflow/community/pull/262)+registration outlined in the [Usage Overview](#Usage overview) section.++### Core TensorFlow ```cpp-typedef (*SEPluginInitFn)(SE_PlatformRegistrationParams*, TF_Status*);+typedef void (*SEInitPluginFn)(SE_PlatformRegistrationParams*, TF_Status*); ... -void* plugin = dlopen("myplugin.so", ...);-if (!plugin) {-  ... output error and skip this plugin ...-}-void* initialize_sym = dlsym(plugin, "SE_InitializePlugin");+void* initialize_sym = dlsym(plugin_dso_handle, "SE_InitPlugin"); if (!initialize_sym) {-  ... output error and skip this plugin ...+  // Output error and skip this plug-in. }-SEPluginInitFn initialize_fn = reinterpret_cast<SEPluginInitFn>(initialize_sym);+SEInitPluginFn initialize_fn = reinterpret_cast<SEInitPluginFn>(initialize_sym);  SE_PlatformRegistrationParams params; TF_Status* status = TF_NewStatus();

Sounds good.

penpornk

comment created time in 17 days

PullRequestReviewEvent

Pull request review commentannarev/community

Updates based on recent changes and discussion on RFC PR #262

 typedef struct SP_StreamExecutor {                            TF_Status* status);    // Causes the host code to synchronously wait for the event to complete.-  void (*block_host_for_event)(-      SP_Device* executor, SP_Event event, TF_Status* status);+  void (*block_host_for_event)(const SP_Device* device, SP_Event event,+                               TF_Status* status);++  // [Optional]+  // Causes the host code to synchronously wait for operations entrained onto+  // stream to complete. Effectively a join on the asynchronous device+  // operations enqueued on the stream before this program point.+  // If not set, then corresponding functionality will be implemented+  // by registering an event on the `stream` and waiting for it using+  // `block_host_for_event`.+  void (*block_host_until_done)(const SP_Device* device, SP_Stream stream,+                                TF_Status* status);    // Synchronizes all activity occurring in the StreamExecutor's context (most   // likely a whole device).-  TF_BOOL (*synchronize_all_activity)(SP_Device* executor);--  // Obtains metadata about the underlying device.-  void (*fill_device_description)(SP_Device* executor,-                                  SP_DeviceDescription* description,-                                  TF_Status* status);+  void (*synchronize_all_activity)(const SP_Device* device, TF_Status* status);    // Enqueues on a stream a user-specified function to be run on the host.-  TF_BOOL (*host_callback)(SP_Device* executor, SP_Stream* stream,-                           TF_StatusCallbackFn callback_fn, void* ctx);+  // `callback_arg` should be passed as the first argument to `callback_fn`.+  TF_Bool (*host_callback)(SP_Device* device, SP_Stream stream,+                           SE_StatusCallbackFn callback_fn, void* callback_arg); } SP_StreamExecutor; -#define SP_STREAMEXECUTOR_STRUCT_SIZE TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)+#define SP_STREAMEXECUTOR_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)++typedef struct SE_CreateStreamExecutorParams {+  size_t struct_size;+  void* ext;  // reserved for future use++  SP_StreamExecutor* stream_executor;  // output, to be filled by plugin+} SE_CreateStreamExecutorParams;++#define SE_CREATE_STREAM_EXECUTOR_PARAMS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SE_CreateStreamExecutorParams, stream_executor)  typedef struct SP_Platform {   size_t struct_size;-  -  // Free form data set by plugin.-  void* ext;-  -  // Platform name++  void* ext;  // free-form data set by plugin++  // Platform name. Must be null-terminated.   const char* name;-  size_t name_len;-  -  // Device type name, for example GPU.-  char* type;-  size_t type_len;-  -  // Callbacks for creating/destroying.-  void (*create_device)(-      SP_Device* device,  \\ out-      SE_Options* options, \\ in-      TF_Status* status);  \\ out-  void (*destroy_device)(SP_Device* device);-  -  // Callbacks for creating/destroying SE_StreamExecutor.-  void (*create_stream_executor)(-      SP_StreamExecutor*,  \\ out-      TF_Status* status);  \\ out-  void (*destroy_stream_executor)(SP_StreamExecutor* stream_executor);++  // Device type name, for example GPU. Must be null-terminated.+  const char* type;++  // Number of visible devices+  size_t visible_device_count;++  // Whether this platform supports unified memory.+  // Unified memory is a single memory address space accessible from any device.+  TF_Bool supports_unified_memory; } SP_Platform; -#define SP_PLATFORM_SIZE TF_OFFSET_OF_END(SP_Platform, destroy_stream_executor)+#define SP_PLATFORM_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_Platform, supports_unified_memory)++typedef struct SP_PlatformFns {+  size_t struct_size;++  void* ext;  // reserved for future use++  // Callbacks for creating/destroying SP_Device.+  void (*create_device)(const SP_Platform* platform,+                        SE_CreateDeviceParams* params, TF_Status* status);++  // Clean up fields inside SP_Device that were allocated+  // by the plugin. `device` itself should not be deleted here.+  void (*destroy_device)(const SP_Platform* platform, SP_Device* device);++  // Callbacks for creating/destroying SP_StreamExecutor.+  void (*create_stream_executor)(const SP_Platform* platform,+                                 SE_CreateStreamExecutorParams* params,+                                 TF_Status* status);+  // Clean up fields inside SP_StreamExecutor that were allocated+  // by the plugin. `stream_executor` itself should not be deleted here.+  void (*destroy_stream_executor)(const SP_Platform* platform,+                                  SP_StreamExecutor* stream_executor);++  // Callbacks for creating/destroying SP_TimerFns.+  void (*create_timer_fns)(const SP_Platform* platform, SP_TimerFns* timer,+                           TF_Status* status);++  void (*destroy_timer_fns)(const SP_Platform* platform,+                            SP_TimerFns* timer_fns);+} SP_PlatformFns;++#define SP_PLATFORM_FNS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_PlatformFns, destroy_timer_fns)  typedef struct SE_PlatformRegistrationParams {   size_t struct_size;-  void* ext;-  +  void* ext;  // reserved for future use+   // StreamExecutor C API version.   int32_t major_version;   int32_t minor_version;   int32_t patch_version;-  -  // Must be filled by the plugin.-  SP_Platform platform;  // out++  SP_Platform* platform;         // output, set by plugin+  SP_PlatformFns* platform_fns;  // output, set by plugin+  // Clean up fields inside SP_Platform that were allocated+  // by the plugin. `platform` itself should not be deleted here.+  void (*destroy_platform)(SP_Platform* platform);  // out, set by plugin+  void (*destroy_platform_fns)(+      SP_PlatformFns* platform_fns);  // out, set by plugin } SE_PlatformRegistrationParams; -#define SE_PLATFORM_REGISTRATION_PARAMS_SIZE TF_OFFSET_OF_END(SE_PlatformRegistrationParams, platform)+#define SE_PLATFORM_REGISTRATION_PARAMS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SE_PlatformRegistrationParams, destroy_platform_fns) -void SE_InitializePlugin(SE_PlatformRegistrationParams* params, TF_Status* status);+void SE_InitPlugin(SE_PlatformRegistrationParams* params, TF_Status* status);  #ifdef __cplusplus-} // extern "C"+}  // extern "C" #endif ``` -## Registration implementation+### PlatformId -Registration will be implemented by registering a new StreamExecutor platform as well as a new TensorFlow device with [DeviceFactory](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/common_runtime/device_factory.h;l=30?q=DeviceFactory).+StreamExecutor [Platform](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/platform.h;l=114) has an id parameter. This parameter will be hidden from the C API and set+internally by TensorFlow instead. ++## Usage Example+Code example for [PluggableDevice](https://github.com/tensorflow/community/pull/262)+registration outlined in the [Usage Overview](#Usage overview) section.++### Core TensorFlow ```cpp-typedef (*SEPluginInitFn)(SE_PlatformRegistrationParams*, TF_Status*);+typedef void (*SEInitPluginFn)(SE_PlatformRegistrationParams*, TF_Status*); ... -void* plugin = dlopen("myplugin.so", ...);-if (!plugin) {-  ... output error and skip this plugin ...-}-void* initialize_sym = dlsym(plugin, "SE_InitializePlugin");+void* initialize_sym = dlsym(plugin_dso_handle, "SE_InitPlugin"); if (!initialize_sym) {-  ... output error and skip this plugin ...+  // Output error and skip this plug-in. }-SEPluginInitFn initialize_fn = reinterpret_cast<SEPluginInitFn>(initialize_sym);+SEInitPluginFn initialize_fn = reinterpret_cast<SEInitPluginFn>(initialize_sym);  SE_PlatformRegistrationParams params; TF_Status* status = TF_NewStatus();

TF_Status* status = TF_NewStatus(); is correct. But initialize_fn(&params, &status); should instead be changed to initialize_fn(&params, status);

penpornk

comment created time in 17 days

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentannarev/community

Updates based on recent changes and discussion on RFC PR #262

 typedef struct SP_StreamExecutor {                            TF_Status* status);    // Causes the host code to synchronously wait for the event to complete.-  void (*block_host_for_event)(-      SP_Device* executor, SP_Event event, TF_Status* status);+  void (*block_host_for_event)(const SP_Device* device, SP_Event event,+                               TF_Status* status);++  // [Optional]+  // Causes the host code to synchronously wait for operations entrained onto+  // stream to complete. Effectively a join on the asynchronous device+  // operations enqueued on the stream before this program point.+  // If not set, then corresponding functionality will be implemented+  // by registering an event on the `stream` and waiting for it using+  // `block_host_for_event`.+  void (*block_host_until_done)(const SP_Device* device, SP_Stream stream,+                                TF_Status* status);    // Synchronizes all activity occurring in the StreamExecutor's context (most   // likely a whole device).-  TF_BOOL (*synchronize_all_activity)(SP_Device* executor);--  // Obtains metadata about the underlying device.-  void (*fill_device_description)(SP_Device* executor,-                                  SP_DeviceDescription* description,-                                  TF_Status* status);+  void (*synchronize_all_activity)(const SP_Device* device, TF_Status* status);    // Enqueues on a stream a user-specified function to be run on the host.-  TF_BOOL (*host_callback)(SP_Device* executor, SP_Stream* stream,-                           TF_StatusCallbackFn callback_fn, void* ctx);+  // `callback_arg` should be passed as the first argument to `callback_fn`.+  TF_Bool (*host_callback)(SP_Device* device, SP_Stream stream,+                           SE_StatusCallbackFn callback_fn, void* callback_arg); } SP_StreamExecutor; -#define SP_STREAMEXECUTOR_STRUCT_SIZE TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)+#define SP_STREAMEXECUTOR_STRUCT_SIZE \+  TF_OFFSET_OF_END(SP_StreamExecutor, host_callback)++typedef struct SE_CreateStreamExecutorParams {+  size_t struct_size;+  void* ext;  // reserved for future use++  SP_StreamExecutor* stream_executor;  // output, to be filled by plugin+} SE_CreateStreamExecutorParams;++#define SE_CREATE_STREAM_EXECUTOR_PARAMS_STRUCT_SIZE \+  TF_OFFSET_OF_END(SE_CreateStreamExecutorParams, stream_executor)  typedef struct SP_Platform {   size_t struct_size;-  -  // Free form data set by plugin.-  void* ext;-  -  // Platform name++  void* ext;  // free-form data set by plugin

SP_Platform is populated by the plugin. Therefore, void* ext is meant to be a field that plugin can set to anything they want. It is not reserved for future use.

penpornk

comment created time in 17 days

Pull request review commentannarev/community

Updates based on recent changes and discussion on RFC PR #262

 typedef struct SP_StreamExecutor {                            TF_Status* status);    // Causes the host code to synchronously wait for the event to complete.-  void (*block_host_for_event)(-      SP_Device* executor, SP_Event event, TF_Status* status);+  void (*block_host_for_event)(const SP_Device* device, SP_Event event,+                               TF_Status* status);++  // [Optional]+  // Causes the host code to synchronously wait for operations entrained onto+  // stream to complete. Effectively a join on the asynchronous device+  // operations enqueued on the stream before this program point.+  // If not set, then corresponding functionality will be implemented+  // by registering an event on the `stream` and waiting for it using+  // `block_host_for_event`.+  void (*block_host_until_done)(const SP_Device* device, SP_Stream stream,+                                TF_Status* status);    // Synchronizes all activity occurring in the StreamExecutor's context (most   // likely a whole device).-  TF_BOOL (*synchronize_all_activity)(SP_Device* executor);--  // Obtains metadata about the underlying device.-  void (*fill_device_description)(SP_Device* executor,-                                  SP_DeviceDescription* description,-                                  TF_Status* status);+  void (*synchronize_all_activity)(const SP_Device* device, TF_Status* status);    // Enqueues on a stream a user-specified function to be run on the host.-  TF_BOOL (*host_callback)(SP_Device* executor, SP_Stream* stream,-                           TF_StatusCallbackFn callback_fn, void* ctx);+  // `callback_arg` should be passed as the first argument to `callback_fn`.

This is an instruction for the plugin author. Therefore "should" or "must" seems better.

penpornk

comment created time in 17 days

PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: TFRT Kernel Fallback

+# TFRT Kernel Fallback++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [266](https://github.com/tensorflow/community/pull/266) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Jeremy Lau (lauj@google.com) |+| **Sponsor**   | Jeremy Lau (lauj@google.com)                            |+| **Updated**   | 2020-07-16                                              |++## Objective++This proposal focuses on getting a majority of "well-behaved" TensorFlow ops running efficiently on+mobile devices by removing the need to execute them via the TensorFlow eager runtime, instead+calling kernels directly from the new [TFRT](https://github.com/tensorflow/runtime) TensorFlow runtime.++Note that there is an effort to call existing kernels by delegating to+TensorFlow eager runtime instead. This approach is called Runtime Fallback and+corresponding RFC will be published soon. The goals of the two fallback+mechanisms are as follows:++*   Runtime Fallback aims to reuse all current TensorFlow kernels in TFRT.+*   Kernel Fallback (focus of this document) aims to get a large number of+    existing kernels working in TFRT while reducing binary size to support+    mobile devices.++| Runtime Fallback        |  Kernel Fallback |+:------------------------:|:-----------------:+<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/RuntimeFallbackHighLevelDiagram.png" alt="Runtime Fallback high-level diagram." width="150px"> | <img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackHighLevelDiagram.png" alt="Kernel Fallback high-level diagram." width="150px">++## Goals++High level goals of the project:++*   Call existing kernels from new TensorFlow runtime+*   Reduce size and overhead to make this a feasible option for mobile++We address the first goal by implementing a new fallback mechanism that directly+calls TensorFlow kernels without going through Eager runtime first. We plan to+address the second high level goal by trimming down dependencies, switching to+more compact proto representation, etc.++### Op Coverage Goals++First of all, we plan to target all the easier-to-support ops that don’t require+implementing extensive pieces of infrastructure, but at the same time provide+the most value to the TF Lite team.++We analysed how many kernels we can support in the future and include our+findings in the following spreadsheets. As we describe in+[Design Proposal](#design-proposal) below, Kernel Fallback depends on+customizing+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=OpKernelConstruction)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=OpKernelContext&ss=tensorflow%2Ftensorflow)+classes. Number of supported kernels will depend on the surface we manage to+customize. (Note that I have already started prototyping the implementation that+includes a few common methods such as `input`, `output`. The spreadsheet below+considers these methods to be already *supported*).++*   List of kernels and `OpKernelConstruction`/`OpKernelContext` methods they+    require:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/support_priority.md)+*   Proposed implementation order for these methods:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/kernel_to_unsupported.md)++Based on these estimates, we can support >= 423 kernels. Note that this number+is just based on the `OpKernelConstruction`/`OpKernelContext` coverage that we+can provide. It doesn't take into consideration other issues we might face.++### TFRT Integration Goals++We want to support executing a [BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md ) file+on mobile device that calls kernels using Kernel Fallback mechanism. Users will+be able to generate a BEF file based on a saved model and we will provide a+script to create it.++We might also want to support running ops using TFRT eager mode (that is, add a+custom+[OpHandler](https://github.com/tensorflow/runtime/blob/3c7a1ea02c87325f1b47aebb24b3ca6e84e7e7e7/include/tfrt/core_runtime/op_handler.h#L47)).++## Non-goals++*   Supporting all existing ops. `OpKernelContext` surface is quite large and+    implementing all of it would require a significant amount of time. Instead,+    we will start by adding most common and easy functionality. If certain+    functionality is only used by a handful of kernels, it might make sense to+    implement TFRT native kernels or rely on runtime fallback instead. One notable example is+    [ResourceMgr](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/resource_mgr.h;l=152?q=ResourceMgr).+    We might support it later, but it is definitely not first priority due to+    extra effort required.+*   Gradients would not be supported by the first iteration of Kernel Fallback,+    but we might revisit it later.+*   Exact details of TFRT integration are still being worked out by TFRT and TF+    Lite teams. Since these teams might change the plan, exact details are not a+    part of this doc. The take away is that we will integrate kernel fallback+    following the approach they decide on.++## Motivation++Currently, [TF Lite](https://www.tensorflow.org/lite) supports a+[limited set of ops](https://www.tensorflow.org/lite/guide/ops_compatibility).+As the range and variety of applications grows, it becomes essential to grow the+pool of available ops on mobile devices, ideally supporting everything that fully-fledged+TensorFlow supports now.++However, supporting TensorFlow ops on mobile devices presents some challenges.+Specifically, binary size on mobile platforms should be restricted. TF Lite team+provided us with the following *ideal* numbers:++*   100-200k overhead to call TF kernels+*   20k / kernel marginal size++To get closer to the size restrictions we plan to define a call path from TFRT+to TensorFlow kernels that minimizes the amount of generated code.++## User Benefit++Running more kernels on mobile devices would allow TensorFlow users to implement+a wider range of models for mobile devices. Reduced binary size will also benefit users that currently use TensorFlow Lite's experimental [TensorFlow Select ops] (https://www.tensorflow.org/lite/guide/ops_select), or users that do not use the experimental feature because of that reason.++## Design Proposal++We propose to call the kernel’s Compute method directly from+[TFRT](https://github.com/tensorflow/runtime) without going through TensorFlow+Eager C API first. We introduce kernel context and registration implementation that support core+kernel functionality with minimal dependencies.++High-level diagram of the proposed design:++<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackDiagram.png" alt="Kernel Fallback design diagram." width="400px">++## Kernel registration++We will use a separate registry for kernels supported by TFRT forwarding. To do+so, we will define a `TFRTOpKernelFactories` class that would keep a map from+kernel name to a list of registrations.++```cpp+class TFRTOpKernelFactories {+ public:+  TFRTOpKernelFactories();+  void RegisterFactory(StringPiece kernel_class_name,+                       TFRTOpKernelReg kernel_info);++  // Creates a kernel with the given name and passes op_kernel_construction+  // to kernel constructor.+  // Returns the constructed kernel on success.+  // In case of failure, returns a nullptr. Kernel creation can fail in one+  // of the following cases:+  //   1. Kernel with the given name is not found.+  //   2. Attributes in op_kernel_construction don't match type constraints+  //      for any of the kernels with this name.+  //      Note that we consider a constraint to be "not matched" if the attribute+  //      it applies to is not in op_kernel_construction.+  std::unique_ptr<TFRTOpKernel> CreateKernel(+      StringPiece kernel_class_name,+      TFRTOpKernelConstruction* op_kernel_construction) const;++ private:+  llvm::StringMap<std::vector<TFRTOpKernelReg>> factories_;+};++extern llvm::ManagedStatic<TFRTOpKernelFactories> fallback_kernel_factories;+```++Similar to the current TensorFlow kernel registration, we will introduce a+registration macro that adds a kernel to `TFRTOpKernelFactories`.++```cpp+#define REGISTER_FALLBACK_KERNEL(name, ...) \+  REGISTER_FALLBACK_KERNEL_UNIQ_HELPER(__COUNTER__, name, __VA_ARGS__)++#define REGISTER_FALLBACK_KERNEL_UNIQ_HELPER(ctr, name, ...) \+  REGISTER_FALLBACK_KERNEL_UNIQ(ctr, name, __VA_ARGS__)++#define REGISTER_FALLBACK_KERNEL_UNIQ(ctr, name, ...)             \+  static bool global_fallback_kernel_##ctr##_registered_ = []() { \+    ::tensorflow::fallback_kernel_factories->RegisterFactory(     \+        name, TFRTOpKernelReg([](TFRTOpKernelConstruction* construction) \+                                  -> std::unique_ptr<TFRTOpKernel> {     \+          return std::make_unique<__VA_ARGS__>(construction);            \+        }));                                                             \+    return true;                                                         \+  }();+```++## Op registration++To support type specification, we will also provide a minimal Op registry and+corresponding macro `REGISTER_KERNEL_FALLBACK_OP`. Sample implementation:++```cpp+// TFRTOpMetaBuilder class will provide ways to set input, output and+// attribute specifications.+class TFRTOpMetaBuilder {+ public:+  explicit TFRTOpMetaBuilder(StringPiece op_name);+  TFRTOpMetaBuilder& Output(StringPiece output_spec);+  ...+};++// Registration will add the op to a static map.+class TFRTOpRegisterer {+ public:+  TFRTOpRegisterer(const TFRTOpMetaBuilder& op_builder);+};++#define REGISTER_KERNEL_FALLBACK_OP(name) \+  REGISTER_KERNEL_FALLBACK_OP_UNIQ(__COUNTER__, name)++#define REGISTER_KERNEL_FALLBACK_OP_UNIQ(ctr, name)                         \+  static TFRTOpRegisterer global_fallback_op_meta_builder_##ctr##_ = \+      TFRTOpMetaBuilder(name)+```++Usage example:+```cpp+REGISTER_KERNEL_FALLBACK_OP("AddN").Output("out: int32");+```++## Kernel implementation++TensorFlow kernels inherit from the+[OpKernel](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=82?q=opkernel)+class and depend on two key classes:+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=opkernel)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=opkernel).+We want to provide custom implementations of these two classes in terms of data+we get from TFRT (for e.g. inputs, attributes).++There are two main approaches to customize class implementations:++*   Use inheritance and define common interfaces.+*   Use templates.++We ran multiple benchmarks to get an idea of the trade offs between inheritance+and templating approaches. Key findings are summarized below:++*   Time difference negligible for full model benchmarks.+*   A simple scalar op benchmark with Kernel Fallback (runs scalar+    multiplication, division, addition) was only 0.3% slower on mobile with+    inheritance compared to templates. The benchmark was run on a real device (Pixel 3) with ABI: arm64-v8a and SDK version: 29.+*   [basic\_ops\_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark_test)+    with inheritance is significantly slower: ~7% (median) or ~19% (mean)+    (running on Linux). Note that this difference was measured *without* Kernel+    Fallback. Adding inheritance would impact all existing TensorFlow kernels+    even those that don't support Kernel Fallback.+*   Binary size increase when using templates compared to inheritance is+    estimated at 2.6% (based on adding `AddN` op).+    +Right now, we are leaning towards using inheritance. Seems like time increase is+only significant for running many scalar ops in a sequence - probably a rare use+case in the real world. (See more details in [Appendix 2](#appendix-2-extension-options))++To use inheritance, we will define `OpKernelConstructionInterface` and+`OpKernelContextInterface` interfaces. Ideally, these interfaces should be pure+virtual. However, we will have one exception - templated `eigen_device` method+that calls per-device pure-virtual implementations.++We will then introduce `TFRTOpKernelConstruction` and `TFRTOpKernelContext`+subclasses that implement `OpKernelConstructionInterface` and+`OpKernelContextInterface` in terms of TFRT data structures. Example how+`TFRTOpKernelConstruction` might look like:++```cpp+class TFRTOpKernelConstruction : public OpKernelConstructionInterface {+ public:+  explicit TFRTOpKernelConstruction(AttrMap attributes);+  ~TFRTOpKernelConstruction() override {};++  Status GetAttr(StringPiece attr_name, int32* value) const override;+  Status GetAttr(StringPiece attr_name, DataType* value) const override;++  void CtxFailure(const Status& s);+  void CtxFailureWithWarning(const Status& s);+  void CtxFailure(const char* file, int line, const Status& s);+  void CtxFailureWithWarning(const char* file, int line, const Status& s);+  ...+};+```++When running Kernel Fallback, we instantiate the kernel interfaces with TFRT’s lightweight+OpKernel definitions, rather than TensorFlow’s+[heavyweight OpKernel definitions](https://cs.opensource.google/android/platform/superproject/+/master:external/tensorflow/tensorflow/core/framework/op_kernel.h;l=612?q=opkernelcontext)+for example.++Example `AddN` kernel implementation using these new interfaces:++```cpp+class AddNOp : public OpKernelBase {+ public:+   explicit AddNOp(OpKernelConstructionInterface* construction) :+       OpKernelBase(construction) {}++   void Compute(OpKernelContextInterface* ctx) override {+     if(!ctx->ValidateInputsAreSameShape(this)) return;+     ...+```++Here, `OpKernelBase` implementation will be minimal:++```cpp+class OpKernelBase {+ public:+  explicit OpKernelBase(OpKernelConstructionInterface* context) {+  }+  virtual ~OpKernelBase() {}+  virtual void Compute(OpKernelContextInterface* context) = 0;+};+```++(For details how extending from `OpKernelBase` instead of `OpKernel` would work+with current TensorFlow runtime see [Appendix 1](#appendix-1-kernel-wrapper))++Corresponding .cc file then registers the kernel using the correct kernel and+context classes. For example, this is how we register `AddN` kernel with TFRT:++```cpp+REGISTER_FALLBACK_KERNEL( "AddN", AddNOp<CPUDevice, int32>);+```++## Calling kernel++We add a new TFRT BEF kernel called `tfrt_fallback.kernel_fallback`. This kernel directly+calls a TF kernel’s `Compute` method by creating `TFRTOpKernel*` data structures+that forward to corresponding TFRT concepts. For example, the following code+accesses an input in `llvm::ArrayRef<tfrt::RCReference<tfrt::AsyncValue>>` which+we get from TFRT:++```cpp+const Tensor& TFRTOpKernelContext::input(int index) {+  return inputs_[index]->get<Tensor>();+}+```++Simplified definition of `tfrt_fallback.kernel_fallback`:++```cpp+// Instantiate a kernel. This would be a TensorFlow kernel converted to inherit+// from `OpKernelBase` instead of `OpKernel`.+std::unique_ptr<OpKernelBase> op  =  …;++// Create TFRTOpKernelContext. The variable exec_ctx here is the tfrt::ExecutionContext passed to the kernel handler.+TFRTOpKernelContext op_kernel_context(inputs, outputs.size(), op_meta, exec_ctx.host());++// Directly invoke the TF kernel's Compute() method.+op->Compute(&op_kernel_context);+```++## tfrt\_fallback.kernel\_fallback call structure++We will be using the following conventions (essentially, these are based on+Runtime Fallback work that will probably have RFC coming soon):++*   Attributes are passed as key-value pairs, where both key and value are+    represented as strings.+*   Types have a specific string representation. We are trying to use names+    consistent with BEF syntax as much as possible (for e.g. `f32` represents+    `float`).+*   Inputs and outputs have type `tensorflow::Tensor`. We will provide BEF+    kernels to construct these from BEF data (for e.g. constant values).++Example of invoking Conv3D kernel:++```+%tft_c = "tfrt_fallback.kernel_fallback"(%tft_a, %tft_b) {+    _op_name = "Conv3D",+    attr1_name="data_format", attr1_value="string$NDHWC",+    attr2_name="strides", attr2_value="list(i32)$1,1,1,1,1",+    attr3_name="dilations", attr3_value="list(i32)$1,1,1,1,1",+    attr4_name="padding", attr4_value="padding$SAME"}: (!tfd.tensor, !tfd.tensor) -> !tfd.tensor+```++For example, `dilations` attribute here has a value of `[1, 1, 1, 1, 1]`.+Note: TFRT orders attributes by name, alphabetically, which is why we use `attrN_value` and `attrN_name` pattern pair.++## Reusing Kernels++TensorFlow currently reuses kernels instantiated for a particular node in a+graph. It would be nice to have this optimization for Kernel fallback as well.++BEF executor keeps track of offsets within a BEF file. We can use this offset to+cache corresponding kernel objects.++We should make sure that Kernel Fallback is thread safe when reusing kernel+objects since Compute for the same kernel can be called from multiple threads.+We can take a simple approach and support kernel cache only for stateless+kernels. Stateless kernels only update `OpKernelContext` and not `OpKernel`+state itself.++## C API Integration++Modular TensorFlow effort aims to break up giant monolithic TensorFlow binaries+into smaller shared libraries. Specifically, James (@sjamesr) and Gunhan+(@gunhan) looked at splitting out kernels out of TensorFlow core. Initial Kernel+C API definition is at+[kernel.h](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.h)+and its implementation is at+[kernel.cc](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc?q=kernels.cc).++Kernel Fallback should support kernels migrated to C API as well. We can+implement this support behind the C API, so that we don’t have to update+individual kernels.++### C API multiple implementation structure++There are a few important takeaways from current kernel C API implementation+that will impact decisions in the document:++1.  We register a+    [COpKernel](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc;l=104?q=copkernel)+    object (with TensorFlow op kernel registry) for _any_ kernel defined using+    the C API.+1.  `OpKernelContext` and `OpKernelConstruction` are passed around as opaque+    pointers on the C API surface (they get cast to `TF_OpKernelContext` and+    `TF_OpKernelConstruction` aliases).+1.  Most of the functions just provide accessors into+    `OpKernelContext`/`OpKernelConstruction` types.++Given current API structure, we can consider two approaches going forward:++1.  TFRT fully supports all functionality available in the C API. This way any+    kernel defined using the C API would be automatically available using either+    full TensorFlow or the TFRT kernel fallback.+1.  Certain functionality is only available with TF backend. TFRT C API+    implementation falls back to full TensorFlow in these cases.++I recommend that we prioritize option 1 and try to get it working (i.e. support+all functionality with both TensorFlow and TFRT C API backend). It already takes+a significant effort to support more kernels with C API, so we can put a little+extra effort and make sure it is supported by both runtimes.++We propose to provide two implementations of the kernel C API. First+implementation is the+[current one](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc) -+implemented in terms of TensorFlow runtime. Second implementation will use TFRT+Kernel Fallback instead. We can select between the two kernel C API+implementations by adding a build config setting:++```+# Whether to use TFRT-based implementation of the kernel C API.+config_setting(+    name = "tfrt_kernel_c_api",+    define_values = {+        "tfrt_kernel_c_api": "True",+    },+)+```++Most of the kernel C API implementation will be the same between the two with a+few notable exceptions:++*   TFRT Kernel Fallback implementation will cast `TF_OpKernelContext` and+    `TF_OpKernelConstruction` to `TFRTOpKernelContext` and+    `TFRTOpKernelConstruction` respectively.+*   TFRT Kernel Fallback implementation will use Kernel Fallback registration+    mechanism.++### TFRT Kernel Fallback registration using C API++We plan to implement C API for TFRT kernel registration that calls TFRT Kernel+Fallback registration mechanism. Note that this is analogous to TF Lite+providing+[their own C API registration mechanism](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/lite/c/common.h;l=739?q=tfliteregistration&ss=tensorflow%2Ftensorflow).++```cpp+TF_KernelBuilder* TF_NewKernelBuilder(+    const char* op_name, const char* device_name,+    void* (*create_func)(TF_OpKernelConstruction*),+    void (*compute_func)(void*, TF_OpKernelContext*),+    void (*delete_func)(void*)) {+  TF_KernelBuilder* result = new TF_KernelBuilder;+  result->create_function = create_func;+  result->compute_function = compute_func;+  result->delete_function = delete_func;+  return result;+}++void TF_RegisterKernelBuilder(const char* name,+                              TF_KernelBuilder* builder,+                              TF_Status* status) {+  auto* create_fn = builder->create_function;+  auto* compute_fn = builder->compute_function;+  auto* delete_fn = builder->delete_function;+  auto create_kernel = [create_fn, compute_fn, delete_fn](+      TFRTOpKernelConstruction* construction) {+    return std::make_unique<tensorflow::TFRTCOpKernel>(+        construction, create_fn, compute_fn, delete_fn);+  };+  ::tensorflow::TFRTKernelReg kernelinfo(create_kernel);+  kernelinfo.type_constraints = builder->attr_to_type;+  ::tensorflow::tfrt_kernel_factories->RegisterFactory(+      name, kernelinfo);+  tensorflow::TFRTOpRegisterer(tensorflow::TFRTOpMetaBuilder(name));+  TF_DeleteKernelBuilder(builder);+  TF_SetStatus(status, TF_OK, "");+}+```++## TFRT integration++Current preferred direction would generate a+[BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md) file in advance and then run that+file on a mobile device. Generated BEF file would have to call either native, TF+Lite, runtime fallback or kernel fallback kernels and provide any glue logic in+between (such as tensor conversions).++We also need to consider how kernel or runtime fallback will be selected. This+could be a parameter at BEF file creation step. It might also be good to package+both runtime and kernel fallback implementations in a BEF file to be selected at+runtime.++## Size Reduction++Since we want to run on a mobile platform, we need to look for any opportunity+to cut down size. First of all, we remove dependency on current TensorFlow+runtime (for e.g. we no longer depend on `NodeDef` and `OpDef` protos). We are+also looking at ways to reduce large size contributions of+[absl libraries](https://github.com/abseil/abseil-cpp/tree/master/absl) and+[protos](https://github.com/protocolbuffers/protobuf).++### Protos++We are currently investigating the following options:++*   Switch to [micropb](https://github.com/protocolbuffers/upb). This proto+    implementation provides C interfaces and is more compact.+*   Remove dependency on protos.++### ABSL++We can hide ABSL references behind aliases (see+[tensorflow::StringPiece](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/platform/stringpiece.h;l=33;drc=af7fd02ca40f362c4ac96dd064d6a2224b65d784)+for example) to make it easier to replace all references to save binary size.++@gunhan is also starting an effort to define a library of STL utilities that+helps us cut down on binary size.++## Selecting which kernels to register++We want to add a script to build configurations that can determine required+kernels based on a model. We would then only build these kernels. For now, we will only support selective registration when building from source.++Script details still need to be worked out.++### Alternatives Considered++The main alterantive to TFRT Kernel Fallback is TFRT Runtime Fallback. TFRT+Runtime Fallback will call TensorFlow Eager C API (corresponding RFC should be+published soon). Main trade offs between the two fallbacks are described in the+table below:++Property    | TFRT Kernel Fallback                           | TFRT Runtime Fallback+----------- | ---------------------------------------------- | ---------------------+Generality  | Support subset of ops (for e.g. no resources*) | Support all ops+Scalability | Requires per-kernel updates                    | No kernel changes+Performance | Lower overhead                                 | Higher overhead+Binary size | Lower (no TF runtime)                          | Higher++\* Long term we might support resources, but we consider them lower priority+due to significant work involved.++### Performance Implications++*   Slow down due to adding inheritance for `OpKernelContext` and+    `OpKernelConsturction`.+*   Speed up for lighter weight kernel calls.++We will run benchmarks to check performance numbers as we work on the+implementation.++### Dependencies++No new dependencies.++### Engineering Impact++*   Build / startup time / binary size will be impacted by additional code added+    to implement Kernel Fallback. At the same time one of the goals of Kernel+    Fallback is to provide a lower-binary-size way to run existing TensorFlow+    kernels in TF Lite.

I removed reference to TF Lite and instead changed it to: "At the same time one of the goals of Kernel Fallback is to provide a lower-binary-size way to run existing TensorFlow kernels on mobile." Specifically, this document is only looking at TFRT -> TF kernel interaction on mobile. Whether this path will be triggered purely from TFRT or if a similar approach would be triggered from TF Lite Select is outside of the scope.

annarev

comment created time in 17 days

PullRequestReviewEvent

push eventannarev/community

Anna Revinskaya

commit sha 8ff7ddf56ecd32ed0f02979ffc1ccb1111161529

mobile --> mobile platforms

view details

push time in 17 days

Pull request review commenttensorflow/community

RFC: TFRT Kernel Fallback

+# TFRT Kernel Fallback++| Status        | (Proposed / Accepted / Implemented / Obsolete)          |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [266](https://github.com/tensorflow/community/pull/266) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Jeremy Lau (lauj@google.com) |+| **Sponsor**   | Jeremy Lau (lauj@google.com)                            |+| **Updated**   | 2020-07-14                                              |++## Objective++This proposal focuses on getting a majority of "well-behaved" TensorFlow ops running efficiently on+mobile devices by removing the need to execute them via the TensorFlow eager runtime, instead+calling kernels directly from the new [TFRT](https://github.com/tensorflow/runtime) TensorFlow runtime.++Note that there is an effort to call existing kernels by delegating to+TensorFlow eager runtime instead. This approach is called Runtime Fallback and+corresponding RFC will be published soon. The goals of the two fallback+mechanisms are as follows:++*   Runtime Fallback aims to reuse all current TensorFlow kernels in TFRT.+*   Kernel Fallback (focus of this document) aims to get a large number of+    existing kernels working in TFRT while reducing binary size to support+    mobile devices.++| Runtime Fallback        |  Kernel Fallback |+:------------------------:|:-----------------:+<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/RuntimeFallbackHighLevelDiagram.png" alt="Runtime Fallback high-level diagram." width="150px"> | <img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackHighLevelDiagram.png" alt="Kernel Fallback high-level diagram." width="150px">++## Goals++High level goals of the project:++*   Call existing kernels from new TensorFlow runtime+*   Reduce size and overhead to make this a feasible option for mobile++We address the first goal by implementing a new fallback mechanism that directly+calls TensorFlow kernels without going through Eager runtime first. We plan to+address the second high level goal by trimming down dependencies, switching to+more compact proto representation, etc.++### Op Coverage Goals++First of all, we plan to target all the easier-to-support ops that don’t require+implementing extensive pieces of infrastructure, but at the same time provide+the most value to the TF Lite team.++We analysed how many kernels we can support in the future and include our+findings in the following spreadsheets. As we describe in+[Design Proposal](#design-proposal) below, Kernel Fallback depends on+customizing+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=OpKernelConstruction)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=OpKernelContext&ss=tensorflow%2Ftensorflow)+classes. Number of supported kernels will depend on the surface we manage to+customize. (Note that I have already started prototyping the implementation that+includes a few common methods such as `input`, `output`. The spreadsheet below+consideres these methods to be already *supported*).++*   List of kernels and `OpKernelConstruction`/`OpKernelContext` methods they+    require:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/support_priority.md)+*   Proposed implementation order for these methods:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/kernel_to_unsupported.md)++Based on these estimates, we can support >= 423 kernels. Note that this number+is just based on the `OpKernelConstruction`/`OpKernelContext` coverage that we+can provide. It doesn't take into consideration other issues we might face.++### TFRT Integration Goals++We want to support executing a [BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md ) file+on mobile device that calls kernels using Kernel Fallback mechanism. Users will+be able to generate a BEF file based on a saved model and we will provide a+script to create it.++We might also want to support running ops using TFRT eager mode (that is, add a+custom+[OpHandler](https://github.com/tensorflow/runtime/blob/3c7a1ea02c87325f1b47aebb24b3ca6e84e7e7e7/include/tfrt/core_runtime/op_handler.h#L47)).++## Non-goals++*   Supporting all existing ops. `OpKernelContext` surface is quite large and+    implementing all of it would require a significant amount of time. Instead,+    we will start by adding most common and easy functionality. If certain+    functionality is only used by a handful of kernels, it might make sense to+    implement TFRT native kernels or rely on runtime fallback instead. One notable example is+    [ResourceMgr](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/resource_mgr.h;l=152?q=ResourceMgr).+    We might support it later, but it is definitely not first priority due to+    extra effort required.+*   Gradients would not be supported by the first iteration of Kernel Fallback,+    but we might revisit it later.+*   Exact details of TFRT integration are still being worked out by TFRT and TF+    Lite teams. Since these teams might change the plan, exact details are not a+    part of this doc. The take away is that we will integrate kernel fallback+    following the approach they decide on.++## Motivation++Currently, [TF Lite](https://www.tensorflow.org/lite) supports a+[limited set of ops](https://www.tensorflow.org/lite/guide/ops_compatibility).+As the range and variety of applications grows, it becomes essential to grow the+pool of available ops as well, ideally supporting everything that fully-fledged+TensorFlow supports now.++However, supporting TensorFlow ops on mobile devices presents some challenges.+Specifically, binary size on mobile platforms should be restricted. TF Lite team+provided us with the following *ideal* numbers:++*   100-200k overhead to call TF kernels+*   20k / kernel marginal size++To get closer to the size restrictions we plan to define a call path from TFRT+to TensorFlow kernels that minimizes the amount of generated code.++## User Benefit++Running more kernels on mobile devices would allow TensorFlow Lite users to implement+a wider range of models. Reduced binary size will also benefit users that currently use TensorFlow Lite's experimental [TensorFlow Select ops] (https://www.tensorflow.org/lite/guide/ops_select), or users that do not use the experimental feature because of that reason.++## Design Proposal++We propose to call the kernel’s Compute method directly from+[TFRT](https://github.com/tensorflow/runtime) without going through TensorFlow+Eager C API first. We introduce kernel context and registration implementation that support core+kernel functionality with minimal dependencies.++High-level diagram of the proposed design:++<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackDiagram.png" alt="Kernel Fallback design diagram." width="400px">++## Kernel registration++We will use a separate registry for kernels supported by TFRT forwarding. To do+so, we will define a `TFRTOpKernelFactories` class that would keep a map from+kernel name to a list of registrations.++```cpp+class TFRTOpKernelFactories {+ public:+  TFRTOpKernelFactories();+  void RegisterFactory(StringPiece kernel_class_name,+                       TFRTOpKernelReg kernel_info);++  // Creates a kernel with the given name and passes op_kernel_construction+  // to kernel constructor.+  // Returns the constructed kernel on success.+  // In case of failure, returns a nullptr. Kernel creation can fail in one+  // of the following cases:+  //   1. Kernel with the given name is not found.+  //   2. Attributes in op_kernel_construction don't match type constraints+  //      for any of the kernels with this name.+  //      Note that we consider a constraint to be "not matched" if the attribute+  //      it applies to is not in op_kernel_construction.+  std::unique_ptr<TFRTOpKernel> CreateKernel(+      StringPiece kernel_class_name,+      TFRTOpKernelConstruction* op_kernel_construction) const;++ private:+  llvm::StringMap<std::vector<TFRTOpKernelReg>> factories_;+};++extern llvm::ManagedStatic<TFRTOpKernelFactories>+    tfrt_forwarding_kernel_factories;+```++Similar to the current TensorFlow kernel registration, we will introduce a+registration macro that adds a kernel to `TFRTOpKernelFactories`.++```cpp+#define REGISTER_KERNEL_FALLBACK_KERNEL(name, ...) \+  REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ_HELPER(__COUNTER__, name, __VA_ARGS__)++#define REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ_HELPER(ctr, name, ...) \+  REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ(ctr, name, __VA_ARGS__)++#define REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ(ctr, name, ...)             \+  static bool global_tfrt_forwarding_kernel_##ctr##_registered_ = []() { \+    ::tensorflow::tfrt_forwarding_kernel_factories->RegisterFactory(     \+        name, TFRTOpKernelReg([](TFRTOpKernelConstruction* construction) \+                                  -> std::unique_ptr<TFRTOpKernel> {     \+          return std::make_unique<__VA_ARGS__>(construction);            \+        }));                                                             \+    return true;                                                         \+  }();+```++## Op registration++To support type specification, we will also provide a minimal Op registry and+corresponding macro `REGISTER_KERNEL_FALLBACK_OP`.++## Kernel implementation++TensorFlow kernels inherit from the+[OpKernel](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=82?q=opkernel)+class and depend on two key classes:+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=opkernel)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=opkernel).+We want to provide custom implementations of these two classes in terms of data+we get from TFRT (for e.g. inputs, attributes).++There are two main approaches to customize class implementations:++*   Use inheritance and define common interfaces.+*   Use templates.++We ran multiple benchmarks to get an idea of the trade offs between inheritance+and templating approaches. Key findings are summarized below:++*   Time difference negligible for full model benchmarks.+*   A simple scalar op benchmark with Kernel Fallback (runs scalar+    multiplication, division, addition) was only 0.3% slower on mobile with+    inheritance compared to templates.+*   [basic\_ops\_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark_test)+    with inheritance is significantly slower: ~7% (median) or ~19% (mean)+    (running on Linux). Note that this difference was measured *without* Kernel+    Fallback. Adding inheritance would impact all existing TensorFlow kernels+    even those that don't support Kernel Fallback.+*   Binary size increase when using templates compared to inheritance is+    estimated at 2.6% (based on adding `AddN` op).++Right now, we are leaning towards using inheritance. Seems like time increase is+only significant for running many scalar ops in a sequence - probably a rare use+case in the real world.++To use inheritance, we will define `OpKernelConstructionInterface` and+`OpKernelContextInterface` interfaces. Ideally, these interfaces should be pure+virtual. However, we will have one exception - templated `eigen_device` method+that calls per-device pure-virtual implementations.++We will then introduce `TFRTOpKernelConstruction` and `TFRTOpKernelContext`+subclasses that implement `OpKernelConstructionInterface` and+`OpKernelContextInterface` in terms of TFRT data structures. Example how+`TFRTOpKernelConstruction` might look like:++```cpp+class TFRTOpKernelConstruction : public OpKernelConstructionInterface {+ public:+  explicit TFRTOpKernelConstruction(AttrMap attributes);+  ~TFRTOpKernelConstruction() override {};++  Status GetAttr(StringPiece attr_name, int32* value) const override;+  Status GetAttr(StringPiece attr_name, DataType* value) const override;++  void CtxFailure(const Status& s);+  void CtxFailureWithWarning(const Status& s);+  void CtxFailure(const char* file, int line, const Status& s);+  void CtxFailureWithWarning(const char* file, int line, const Status& s);+  ...+};+```++When forwarding, we instantiate the kernel interfaces with TFRT’s lightweight+OpKernel definitions, rather than TensorFlow’s+[heavyweight OpKernel definitions](https://cs.opensource.google/android/platform/superproject/+/master:external/tensorflow/tensorflow/core/framework/op_kernel.h;l=612?q=opkernelcontext)+for example.++Example `AddN` kernel implementation using these new interfaces:++```cpp+class AddNOp : public OpKernelBase {+ public:+   explicit AddNOp(OpKernelConstructionInterface* construction) :+       OpKernelBase(construction) {}++   void Compute(OpKernelContextInterface* ctx) override {+     if(!ctx->ValidateInputsAreSameShape(this)) return;+     ...+```++Here, `OpKernelBase` implementation will be minimal:++```cpp+class OpKernelBase {+ public:+  explicit OpKernelBase(OpKernelConstructionInterface* context) {+  }+  virtual ~OpKernelBase() {}+  virtual void Compute(OpKernelContextInterface* context) = 0;+};+```++(For details how extending from `OpKernelBase` instead of `OpKernel` would work+with current TensorFlow runtime see [Appendix 1](#appendix-1))++Corresponding .cc file then registers the kernel using the correct kernel and+context classes. For example, this is how we register `AddN` kernel with TFRT:++```cpp+REGISTER_KERNEL_FALLBACK_KERNEL( "AddN", AddNOp<CPUDevice, int32>);+```++## Calling kernel++We add a new TFRT BEF kernel called `tfrt_fallback.kernel_fallback`. This kernel directly+calls a TF kernel’s `Compute` method by creating `TFRTOpKernel*` data structures+that forward to corresponding TFRT concepts. For example, the following code+accesses an input in `llvm::ArrayRef<tfrt::RCReference<tfrt::AsyncValue>>` which+we get from TFRT:++```cpp+const Tensor& TFRTOpKernelContext::input(int index) {+  return inputs_[index]->get<Tensor>();+}+```++Simplified definition of `tfrt_fallback.kernel_fallback`:++```cpp+// Instantiate a kernel. This would be a TensorFlow kernel converted to inherit+// from `OpKernelBase` instead of `OpKernel`.+std::unique_ptr<OpKernelBase> op  =  …;++// Create TFRTOpKernelContext. The variable exec_ctx here is the tfrt::ExecutionContext passed to the kernel handler.+TFRTOpKernelContext op_kernel_context(inputs, outputs.size(), op_meta, exec_ctx.host());++// Directly invoke the TF kernel's Compute() method.+op->Compute(&op_kernel_context);+```++## tfrt\_fallback.kernel\_fallback call structure++We will be using the following conventions (essentially, these are based on+Runtime Fallback work that will probably have RFC coming soon):++*   Attributes are passed as key-value pairs, where both key and value are+    represented as strings.+*   Types have a specific string representation. We are trying to use names+    consistent with BEF syntax as much as possible (for e.g. `f32` represents+    `float`).+*   Inputs and outputs have type `tensorflow::Tensor`. We will provide BEF+    kernels to construct these from BEF data (for e.g. constant values).++Example of invoking Conv3D kernel:++```+%tft_c = "tfrt_fallback.kernel_fallback"(%tft_a, %tft_b) {+    _op_name = "Conv3D",+    attr1_name="data_format", attr1_value="string$NDHWC",+    attr2_name="strides", attr2_value="list(i32)$1,1,1,1,1",+    attr3_name="dilations", attr3_value="list(i32)$1,1,1,1,1",+    attr4_name="padding", attr4_value="padding$SAME"}: (!tfd.tensor, !tfd.tensor) -> !tfd.tensor+```++For example, `dilations` attribute here has a value of `[1, 1, 1, 1, 1]`.+Note: TFRT orders attributes by name, alphabetically, which is why we use `attrN_value` and `attrN_name` pattern pair.++## Reusing Kernels++TensorFlow currently reuses kernels instantiated for a particular node in a+graph. It would be nice to have this optimization for Kernel fallback as well.++BEF executor keeps track of offsets within a BEF file. We can use this offset to+cache corresponding kernel objects.++We should make sure that Kernel Fallback is thread safe when reusing kernel+objects since Compute for the same kernel can be called from multiple threads.+We can take a simple approach and support kernel cache only for stateless+kernels. Stateless kernels only update `OpKernelContext` and not `OpKernel`+state itself.++## C API Integration++Modular TensorFlow effort aims to break up giant monolithic TensorFlow binaries+into smaller shared libraries. Specifically, James (@sjamesr) and Gunhan+(@gunhan) looked at splitting out kernels out of TensorFlow core. Initial Kernel+C API definition is at+[kernel.h](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.h)+and its implementation is at+[kernel.cc](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc?q=kernels.cc).++Kernel Fallback should support kernels migrated to C API as well. We can+implement this support behind the C API, so that we don’t have to update+individual kernels.++### C API multiple implementation structure++There are a few important takeaways from current kernel C API implementation+that will impact decisions in the document:++1.  We register a+    [COpKernel](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc;l=104?q=copkernel)+    object (with TensorFlow op kernel registry) for _any_ kernel defined using+    the C API.+1.  `OpKernelContext` and `OpKernelConstruction` are passed around as opaque+    pointers on the C API surface (they get cast to `TF_OpKernelContext` and+    `TF_OpKernelConstruction` aliases).+1.  Most of the functions just provide accessors into+    `OpKernelContext`/`OpKernelConstruction` types.++Given current API structure, we can consider two approaches going forward:++1.  TFRT fully supports all functionality available in the C API. This way any+    kernel defined using the C API would be automatically available using either+    full TensorFlow or the TFRT kernel fallback.+1.  Certain functionality is only available with TF backend. TFRT C API+    implementation falls back to full TensorFlow in these cases.++I recommend that we prioritize option 1 and try to get it working (i.e. support+all functionality with both TensorFlow and TFRT C API backend). It already takes+a significant effort to support more kernels with C API, so we can put a little+extra effort and make sure it is supported by both runtimes.++We propose to provide two implementations of the kernel C API. First+implementation is the+[current one](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/kernels.cc) -+implemented in terms of TensorFlow runtime. Second implementation will use TFRT+Kernel Fallback instead. We can select between the two kernel C API+implementations by adding a build config setting:++```+# Whether to use TFRT-based implementation of the kernel C API.+config_setting(+    name = "tfrt_kernel_c_api",+    define_values = {+        "tfrt_kernel_c_api": "True",+    },+)+```++Most of the kernel C API implementation will be the same between the two with a+few notable exceptions:++*   TFRT Kernel Fallback implementation will cast `TF_OpKernelContext` and+    `TF_OpKernelConstruction` to `TFRTOpKernelContext` and+    `TFRTOpKernelConstruction` respectively.+*   TFRT Kernel Fallback implementation will use Kernel Fallback registration+    mechanism.++### TFRT forwarding kernel registration using C API++We plan to implement C API for TFRT kernel registration that calls TFRT Kernel+Fallback registration mechanism. Note that this is analogous to TF Lite+providing+[their own C API registration mechanism](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/lite/c/common.h;l=739?q=tfliteregistration&ss=tensorflow%2Ftensorflow).++```cpp+TF_KernelBuilder* TF_NewKernelBuilder(+    const char* op_name, const char* device_name,+    void* (*create_func)(TF_OpKernelConstruction*),+    void (*compute_func)(void*, TF_OpKernelContext*),+    void (*delete_func)(void*)) {+  TF_KernelBuilder* result = new TF_KernelBuilder;+  result->create_function = create_func;+  result->compute_function = compute_func;+  result->delete_function = delete_func;+  return result;+}++void TF_RegisterKernelBuilder(const char* name,+                              TF_KernelBuilder* builder,+                              TF_Status* status) {+  auto* create_fn = builder->create_function;+  auto* compute_fn = builder->compute_function;+  auto* delete_fn = builder->delete_function;+  auto create_kernel = [create_fn, compute_fn, delete_fn](+      TFRTOpKernelConstruction* construction) {+    return std::make_unique<tensorflow::TFRTCOpKernel>(+        construction, create_fn, compute_fn, delete_fn);+  };+  ::tensorflow::TFRTKernelReg kernelinfo(create_kernel);+  kernelinfo.type_constraints = builder->attr_to_type;+  ::tensorflow::tfrt_kernel_factories->RegisterFactory(+      name, kernelinfo);+  tensorflow::TFRTOpRegisterer(tensorflow::TFRTOpMetaBuilder(name));+  TF_DeleteKernelBuilder(builder);+  TF_SetStatus(status, TF_OK, "");+}+```++## TFRT integration++Current preferred direction would generate a+[BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md) file in advance and then run that+file on a mobile device. Generated BEF file would have to call either native, TF+Lite, runtime fallback or kernel fallback kernels and provide any glue logic in

I added a clarification in Goals that this document is focusing on running kernels from TFRT while running on mobile. Exact details how TFRT itself is executed on mobile is outside of the scope here. I also removed reference to TF Lite kernels in this paragraph.

annarev

comment created time in 17 days

PullRequestReviewEvent

push eventannarev/community

Anna Revinskaya

commit sha d193cdad30c2283ad45ae67939a7b34e8c54e82e

Clarify "mobile" + a few other changes

view details

push time in 17 days

PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: TFRT Kernel Fallback

+# TFRT Kernel Fallback++| Status        | (Proposed / Accepted / Implemented / Obsolete)          |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [NNN](https://github.com/tensorflow/community/pull/NNN) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Jeremy Lau (lauj@google.com) |+| **Sponsor**   | Jeremy Lau (lauj@google.com)                            |+| **Updated**   | 2020-07-14                                              |++## Objective++This proposal focuses on getting majority of “well-behaved” ops running in+[TF Lite](https://www.tensorflow.org/lite) by skipping current eager runtime and+calling kernels directly from [TFRT](https://github.com/tensorflow/runtime) (a new+TensorFlow runtime).++Note that there is an effort to call existing kernels by delegating to+TensorFlow eager runtime instead. This approach is called Runtime Fallback and+corresponding RFC will be published soon. The goals of the two fallback+mechanisms are as follows:++*   Runtime Fallback aims to reuse all current TensorFlow kernels in TFRT.+*   Kernel Fallback (focus of this document) aims to get a large number of+    existing kernels working in TFRT while reducing binary size to support+    mobile devices.++| Runtime Fallback        |  Kernel Fallback |+:------------------------:|:-----------------:+<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/RuntimeFallbackHighLevelDiagram.png" alt="Runtime Fallback high-level diagram." width="150px"> | <img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackHighLevelDiagram.png" alt="Kernel Fallback high-level diagram." width="150px">++## Goals++High level goals of the project:++*   Call existing kernels from new TensorFlow runtime+*   Reduce size and overhead to make this a feasible option for mobile++We address the first goal by implementing a new fallback mechanism that directly+calls TensorFlow kernels without going through Eager runtime first. We plan to+address the second high level goal by trimming down dependencies, switching to+more compact proto representation, etc..++### Op Coverage Goals++First of all, we plan to target all the easier-to-support ops that don’t require+implementing extensive pieces of infrastructure, but at the same time provide+the most value to the TF Lite team.++We analysed how many kernels we can support in the future and include our+findings in the following spreadsheets. As we describe in+[Design Proposal](#design-proposal) below, Kernel Fallback depends on+customizing+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=OpKernelConstruction)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=OpKernelContext&ss=tensorflow%2Ftensorflow)+classes. Number of supported kernels will depend on the surface we manage to+customize. (Note that I have already started prototyping the implementation that+includes a few common methods such as `input`, `output`. The spreadsheet below+consideres these methods to be already *supported*).++*   List of kernels and `OpKernelConstruction`/`OpKernelContext` methods they+    require:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/support_priority.md)+*   Proposed implementation order for these methods:+    [here](https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/kernel_to_unsupported.md)++Based on these estimates, we can support >= 423 kernels. Note that this number+is just based on the `OpKernelConstruction`/`OpKernelContext` coverage that we+can provide. It doesn't take into consideration other issues we might face.++### TFRT Integration Goals++We want to support executing a [BEF](https://github.com/tensorflow/runtime/blob/master/documents/binary_executable_format.md ) file+on mobile device that calls kernels using Kernel Fallback mechanism. Users will+be able to generate a BEF file based on a saved model and we will provide a+script to create it.++We might also want to support running ops using TFRT eager mode (that is, add a+custom+[OpHandler](https://github.com/tensorflow/runtime/blob/3c7a1ea02c87325f1b47aebb24b3ca6e84e7e7e7/include/tfrt/core_runtime/op_handler.h#L47)).++## Non-goals++*   Supporting all existing ops. `OpKernelContext` surface is quite large and+    implementing all of it would require a significant amount of time. Instead,+    we will start by adding most common and easy functionality. If certain+    functionality is only used by a handful of kernels, it might make sense to+    implement TFRT native kernels or rely on runtime fallback instead. One notable example is+    [ResourceMgr](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/resource_mgr.h;l=152?q=ResourceMgr).+    We might support it later, but it is definitely not first priority due to+    extra effort required.+*   Gradients would not be supported by the first iteration of Kernel Fallback,+    but we might revisit it later.+*   Exact details of TFRT integration are still being worked out by TFRT and TF+    Lite teams. Since these teams might change the plan, exact details are not a+    part of this doc. The take away is that we will integrate kernel fallback+    following the approach they decide on.++## Motivation++Currently, [TF Lite](https://www.tensorflow.org/lite) supports a+[limited set of ops](https://www.tensorflow.org/lite/guide/ops_compatibility).+As the range and variety of applications grows, it becomes essential to grow the+pool of available ops as well, ideally supporting everything that fully-fledged+TensorFlow supports now.++However, supporting TensorFlow ops on mobile devices presents some challenges.+Specifically, binary size on mobile platforms should be restricted. TF Lite team+provided us with the following *ideal* numbers:++*   100-200k overhead to call TF kernels+*   20k / kernel marginal size++To get closer to the size restrictions we plan to define a call path from TFRT+to TensorFlow kernels that minimizes the amount of generated code.++## User Benefit++Running more kernels on mobile devices would allow TensorFlow Lite users to implement+a wider range of models.++## Design Proposal++We propose to call the kernel’s Compute method directly from+[TFRT](https://github.com/tensorflow/runtime) without going through TensorFlow+Eager C API first. We introduce kernel context and registration implementation that support core+kernel functionality with minimal dependencies.++High-level diagram of the proposed design:++<img src="https://github.com/annarev/community/blob/tfrt_kernel_fallback_rfc/rfcs/20200712-tfrt-kernel-fallback/KernelFallbackDiagram.png" alt="Kernel Fallback design diagram." width="400px">++## Kernel registration++We will use a separate registry for kernels supported by TFRT forwarding. To do+so, we will define a `TFRTOpKernelFactories` class that would keep a map from+kernel name to a list of registrations.++```cpp+class TFRTOpKernelFactories {+ public:+  TFRTOpKernelFactories();+  void RegisterFactory(StringPiece kernel_class_name,+                       TFRTOpKernelReg kernel_info);++  // Creates a kernel with the given name and passes op_kernel_construction+  // to kernel constructor.+  // Returns the constructed kernel on success.+  // In case of failure, returns a nullptr. Kernel creation can fail in one+  // of the following cases:+  //   1. Kernel with the given name is not found.+  //   2. Attributes in op_kernel_construction don't match type constraints+  //      for any of the kernels with this name.+  //      Note that we consider a constraint to be "not matched" if attribute+  //      it applies to is not in op_kernel_construction.+  std::unique_ptr<TFRTOpKernel> CreateKernel(+      StringPiece kernel_class_name,+      TFRTOpKernelConstruction* op_kernel_construction) const;++ private:+  llvm::StringMap<std::vector<TFRTOpKernelReg>> factories_;+};++extern llvm::ManagedStatic<TFRTOpKernelFactories>+    tfrt_forwarding_kernel_factories;+```++Similar to current TensorFlow kernel registartion, we will introduce a+registration macro that adds a kernel to `TFRTOpKernelFactories`.++```cpp+#define REGISTER_KERNEL_FALLBACK_KERNEL(name, ...) \+  REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ_HELPER(__COUNTER__, name, __VA_ARGS__)++#define REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ_HELPER(ctr, name, ...) \+  REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ(ctr, name, __VA_ARGS__)++#define REGISTER_KERNEL_FALLBACK_KERNEL_UNIQ(ctr, name, ...)             \+  static bool global_tfrt_forwarding_kernel_##ctr##_registered_ = []() { \+    ::tensorflow::tfrt_forwarding_kernel_factories->RegisterFactory(     \+        name, TFRTOpKernelReg([](TFRTOpKernelConstruction* construction) \+                                  -> std::unique_ptr<TFRTOpKernel> {     \+          return std::make_unique<__VA_ARGS__>(construction);            \+        }));                                                             \+    return true;                                                         \+  }();+```++## Op registration++To support type specification, we will also provide a minimal Op registry and+corresponding macro `REGISTER_KERNEL_FALLBACK_OP`.++## Kernel implementation++TensorFlow kernels inherit from the+[OpKernel](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=82?q=opkernel)+class and depend on two key classes:+[OpKernelConstruction](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=256?q=opkernel)+and+[OpKernelContext](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/framework/op_kernel.h;l=584?q=opkernel).+We want to provide custom implementations of these two classes in terms of data+we get from TFRT (for e.g. inputs, attributes).++There are two main approaches to customize class implementations:++*   Use inheritance and define common interfaces.+*   Use templates.++We ran multiple benchmarks to get an idea of the trade offs between inheritance+and templating approaches. Key findings are summarized below:++*   Time difference negligible for full model benchmarks.+*   A simple scalar op benchmark with Kernel Fallback (runs scalar+    multiplication, division, addition) was only 0.3% slower on mobile with+    inheritance compared to templates.+*   [basic\_ops\_benchmark](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/kernels/basic_ops_benchmark_test.cc?q=basic_ops_benchmark_test)+    with inheritance is significantly slower: ~7% (median) or ~19% (mean)+    (running on Linux). Note that this difference was measured *without* Kernel+    Fallback. Adding inheritance would impact all existing TensorFlow kernels+    even those that don't support Kernel Fallback.+*   Binary size increase when using templates compared to inheritance is+    estimated at 2.6% (based on adding `AddN` op).++Right now, we are leaning towards using inheritance. Seems like time increase is+only significant for running many scalar ops in a sequence - probably a rare use+case in the real world.++To use inheritance, we will define `OpKernelConstructionInterface` and+`OpKernelContextInterface` interfaces. Ideally, these interfaces should be pure+virtual. However, we will have one exception - templated `eigen_device` method+that calls per-device pure-virtual implementations.++We will then introduce `TFRTOpKernelConstruction` and `TFRTOpKernelContext`+subclasses that implement `OpKernelConstructionInterface` and+`OpKernelContextInterface` in terms of TFRT data structures. Example how+`TFRTOpKernelConstruction` might look like:++```cpp+class TFRTOpKernelConstruction : public OpKernelConstructionInterface {+ public:+  explicit TFRTOpKernelConstruction(AttrMap attributes);+  ~TFRTOpKernelConstruction() override {};++  Status GetAttr(StringPiece attr_name, int32* value) const override;+  Status GetAttr(StringPiece attr_name, DataType* value) const override;++  void CtxFailure(const Status& s);+  void CtxFailureWithWarning(const Status& s);+  void CtxFailure(const char* file, int line, const Status& s);+  void CtxFailureWithWarning(const char* file, int line, const Status& s);+  ...+};+```++When forwarding, we instantiate the kernel interfaces with TFRT’s lightweight+OpKernel definitions, rather than TensorFlow’s+[heavyweight OpKernel definitions](https://cs.opensource.google/android/platform/superproject/+/master:external/tensorflow/tensorflow/core/framework/op_kernel.h;l=612?q=opkernelcontext)+for example.++Example `AddN` kernel implementation using these new interfaces:++```cpp+class AddNOp : public OpKernelBase {+ public:+   explicit AddNOp(OpKernelConstructionInterface* construction) :+       OpKernelBase(construction) {}++   void Compute(OpKernelContextInterface* ctx) override {+     if(!ctx->ValidateInputsAreSameShape(this)) return;+     ...+```++Here, `OpKernelBase` implementation will be minimal:++```cpp+class OpKernelBase {+ public:+  explicit OpKernelBase(OpKernelConstructionInterface* context) {+  }+  virtual ~OpKernelBase() {}+  virtual void Compute(OpKernelContextInterface* context) = 0;+};+```++(For details how extending from `OpKernelBase` instead of `OpKernel` would work+with current TensorFlow runtime see [Appendix 1](#appendix-1))++Corresponding .cc file then registers the kernel using the correct kernel and+context classes. For example, this is how we register `AddN` kernel with TFRT:++```cpp+REGISTER_KERNEL_FALLBACK_KERNEL( "AddN", AddNOp<CPUDevice, int32>);+```++## Calling kernel++We add a new TFRT BEF kernel called `tfrt_fallback.kernel_fallback`. This kernel directly+calls a TF kernel’s `Compute` method by creating `TFRTOpKernel*` data structures+that forward to corresponding TFRT concepts. For example, the following code+accesses an input in `llvm::ArrayRef<tfrt::RCReference<tfrt::AsyncValue>>` which+we get from TFRT:++```cpp+const Tensor& TFRTOpKernelContext::input(int index) {+  return inputs_[index]->get<Tensor>();+}+```++Simplified definition of `tfrt_fallback.kernel_fallback`:++```cpp+// Instantiate a kernel. This would be a TensorFlow kernel converted to inherit+// from `OpKernelBase` instead of `OpKernel`.+std::unique_ptr<OpKernelBase> op  =  …;++// Create TFRTOpKernelContext. The variable exec_ctx here is the tfrt::ExecutionContext passed to the kernel handler.+TFRTOpKernelContext op_kernel_context(inputs, outputs.size(), op_meta, exec_ctx.host());++// Directly invoke the TF kernel's Compute() method.+op->Compute(&op_kernel_context);+```++## tfrt\_fallback.kernel\_fallback call structure++We will be using the following conventions (essentially, these are based on+Runtime Fallback work that will probably have RFC coming soon):++*   Attributes are passed as key-value pairs, where both key and value are+    represented as strings.+*   Types have a specific string representation. We are trying to use names+    consistent with BEF syntax as much as possible (for e.g. `f32` represents+    `float`).+*   Inputs and outputs have type `tensorflow::Tensor`. We will provide BEF+    kernels to construct these from BEF data (for e.g. constant values).++Example of invoking Conv3D kernel:++```+%tft_c = "tfd.kernel_fallback"(%tft_a, %tft_b) {+    _op_name = "Conv3D",+    attr1_name="data_format", attr1_value="string$NDHWC",+    attr2_name="strides", attr2_value="list(i32)$1,1,1,1,1",+    attr3_name="dilations", attr3_value="list(i32)$1,1,1,1,1",+    attr4_name="padding", attr4_value="padding$SAME"}: (!tfd.tensor, !tfd.tensor) -> !tfd.tensor+```++For example, `dilations` attribute here has a value of `[1, 1, 1, 1, 1]`.+Note: TFRT orders attributes by name, alphabetically, which is why we use `attrN_value` and `attrN_name` pattern pair.

Based on discussion at design review. This is true but not documented.

annarev

comment created time in 17 days

push eventannarev/community

Anna Revinskaya

commit sha d1a32624c4201447028ab3bce8709b9ef6917bc7

Benchmark update, wording update `final` keyword removes previously observed large regression

view details

push time in 17 days

PullRequestReviewEvent

Pull request review commentannarev/community

Updated the RFC based on recent changes and discussion on RFC PR #262

-**Authors**: yisitu@, penporn@, annarev@--**Date**: 7/9/20+# StreamExecutor C API Versioning Strategy+| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | Extension of #[257](https://github.com/tensorflow/community/pull/257) |+| **Authors** | Yi Situ (yisitu@google.com), Penporn Koanantakool (penporn@google.com), Anna Revinskaya (annarev@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-09-08                                              |  In reply to a question on [PR #262](https://github.com/tensorflow/community/pull/262#issuecomment-653690654). -# TensorFlow Versioning Strategy--TensorFlow StreamExecutorInterface (SEI) uses struct_size for version checking. Struct members are not allowed to be removed or reordered.  Following are concrete examples of how TensorFlow remains compatible with plug-ins when functionality is added to or removed from StreamExecutorInterface. We will be using a simplified SE_Device as an example.- -## When TensorFlow extends functionality-### Backwards compatibility-TensorFlow is compiled against a newer SEI header (v2), which has SE_Device extended with device_handle.- -**Future TensorFlow compiled against StreamExecutorInterface v2**-```cpp-// StreamExecutorInterface header version 2-typedef struct SE_Device {-  size_t struct_size;-  void* next;           // Always set to zero, reserved by TF for future use.+StreamExecutor C API (SE C API) follows Semantic Versioning 2.0.0+([semver](http://semver.org/)). Each release version has a format+`MAJOR.MINOR.PATCH`, as outlined in [TensorFlow version compatibility](https://www.tensorflow.org/guide/versions#semantic_versioning_20).+We also use `struct_size` to track compatibility.++## Updating Guidelines+This section outlines when to update version numbers specific to SE C API+(`SE_MAJOR`, `SE_MINOR`, and `SE_PATCH`).++### SE_MAJOR+* Potentially backwards incompatible changes.+* If a change is backwards incompatible, it requires an RFC because it will+  break all current plug-ins. This should be rare.+* An `SE_MAJOR` update should be planned in a way that bundles as many pending+  backwards incompatible changes together as possible to avoid breaking plug-ins+  multiple times.+* There will be an announcement giving a grace period before the update happens.++### SE_MINOR+* Backwards compatible changes.+  * Adding a new variable, struct, method, enumeration, etc.+  * Trivial deprecation of a variable, etc. by setting it to a no-op values,+    e.g., 0 or `NULL`.++### SE_PATCH+* Backwards compatible bug fixes.++## Conventions+* Once a member is added to a struct, it cannot be removed, reordered, renamed,+  or repurposed (i.e., assigned a different functionality).+* "Renaming" a member is equivalent to adding a new member with a new name and+  eventually deprecating the member with the old name.+* Fields that cannot be 0 or `NULL` can be deprecated in a backwards compatible+  manner by zero-initialization. +  * Plug-ins must perform input validation on these fields for 0 and `NULL`+    before consuming them.

I feel like its a bit different since TF is responsible for deprecation in both cases.

  • If field is set by TF, then plug-in needs to check if it is 0 before using.
  • If field is set by plug-in and TF deprecates it, then TF would never use it anyways. So "validate for 0 before consuming" is a noop. Instead, makes more sense if TF checks it is "non-zero" and print warning.
penpornk

comment created time in 17 days

PullRequestReviewEvent

Pull request review commentannarev/community

Updated the RFC based on recent changes and discussion on RFC PR #262

-**Authors**: yisitu@, penporn@, annarev@--**Date**: 7/9/20+# StreamExecutor C API Versioning Strategy+| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | Extension of #[257](https://github.com/tensorflow/community/pull/257) |+| **Authors** | Yi Situ (yisitu@google.com), Penporn Koanantakool (penporn@google.com), Anna Revinskaya (annarev@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-09-08                                              |  In reply to a question on [PR #262](https://github.com/tensorflow/community/pull/262#issuecomment-653690654). -# TensorFlow Versioning Strategy--TensorFlow StreamExecutorInterface (SEI) uses struct_size for version checking. Struct members are not allowed to be removed or reordered.  Following are concrete examples of how TensorFlow remains compatible with plug-ins when functionality is added to or removed from StreamExecutorInterface. We will be using a simplified SE_Device as an example.- -## When TensorFlow extends functionality-### Backwards compatibility-TensorFlow is compiled against a newer SEI header (v2), which has SE_Device extended with device_handle.- -**Future TensorFlow compiled against StreamExecutorInterface v2**-```cpp-// StreamExecutorInterface header version 2-typedef struct SE_Device {-  size_t struct_size;-  void* next;           // Always set to zero, reserved by TF for future use.+StreamExecutor C API (SE C API) follows Semantic Versioning 2.0.0+([semver](http://semver.org/)). Each release version has a format+`MAJOR.MINOR.PATCH`, as outlined in [TensorFlow version compatibility](https://www.tensorflow.org/guide/versions#semantic_versioning_20).+We also use `struct_size` to track compatibility.++## Updating Guidelines+This section outlines when to update version numbers specific to SE C API+(`SE_MAJOR`, `SE_MINOR`, and `SE_PATCH`).++### SE_MAJOR+* Potentially backwards incompatible changes.+* If a change is backwards incompatible, it requires an RFC because it will+  break all current plug-ins. This should be rare.+* An `SE_MAJOR` update should be planned in a way that bundles as many pending+  backwards incompatible changes together as possible to avoid breaking plug-ins+  multiple times.+* There will be an announcement giving a grace period before the update happens.++### SE_MINOR+* Backwards compatible changes.+  * Adding a new variable, struct, method, enumeration, etc.+  * Trivial deprecation of a variable, etc. by setting it to a no-op values,+    e.g., 0 or `NULL`.++### SE_PATCH+* Backwards compatible bug fixes.++## Conventions+* Once a member is added to a struct, it cannot be removed, reordered, renamed,+  or repurposed (i.e., assigned a different functionality).+* "Renaming" a member is equivalent to adding a new member with a new name and+  eventually deprecating the member with the old name.+* Fields that cannot be 0 or `NULL` can be deprecated in a backwards compatible+  manner by zero-initialization. +  * Plug-ins must perform input validation on these fields for 0 and `NULL`+    before consuming them.

Should we have another bullet point for fields set by plugin? This case would be more common since most structs we have are populated by plug-in and not TensorFlow. In this case it should be something like:

TensorFlow will validate if any field which should be is deprecated by setting it to 0 or NULL is set to a different value and print a warning.

penpornk

comment created time in 17 days

Pull request review commentannarev/community

Updated the RFC based on recent changes and discussion on RFC PR #262

-**Authors**: yisitu@, penporn@, annarev@--**Date**: 7/9/20+# StreamExecutor C API Versioning Strategy+| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | Extension of #[257](https://github.com/tensorflow/community/pull/257) |+| **Authors** | Yi Situ (yisitu@google.com), Penporn Koanantakool (penporn@google.com), Anna Revinskaya (annarev@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-09-08                                              |  In reply to a question on [PR #262](https://github.com/tensorflow/community/pull/262#issuecomment-653690654). -# TensorFlow Versioning Strategy--TensorFlow StreamExecutorInterface (SEI) uses struct_size for version checking. Struct members are not allowed to be removed or reordered.  Following are concrete examples of how TensorFlow remains compatible with plug-ins when functionality is added to or removed from StreamExecutorInterface. We will be using a simplified SE_Device as an example.- -## When TensorFlow extends functionality-### Backwards compatibility-TensorFlow is compiled against a newer SEI header (v2), which has SE_Device extended with device_handle.- -**Future TensorFlow compiled against StreamExecutorInterface v2**-```cpp-// StreamExecutorInterface header version 2-typedef struct SE_Device {-  size_t struct_size;-  void* next;           // Always set to zero, reserved by TF for future use.+StreamExecutor C API (SE C API) follows Semantic Versioning 2.0.0+([semver](http://semver.org/)). Each release version has a format+`MAJOR.MINOR.PATCH`, as outlined in [TensorFlow version compatibility](https://www.tensorflow.org/guide/versions#semantic_versioning_20).+We also use `struct_size` to track compatibility.++## Updating Guidelines+This section outlines when to update version numbers specific to SE C API+(`SE_MAJOR`, `SE_MINOR`, and `SE_PATCH`).++### SE_MAJOR+* Potentially backwards incompatible changes.+* If a change is backwards incompatible, it requires an RFC because it will+  break all current plug-ins. This should be rare.+* An `SE_MAJOR` update should be planned in a way that bundles as many pending+  backwards incompatible changes together as possible to avoid breaking plug-ins+  multiple times.+* There will be an announcement giving a grace period before the update happens.++### SE_MINOR+* Backwards compatible changes.+  * Adding a new variable, struct, method, enumeration, etc.+  * Trivial deprecation of a variable, etc. by setting it to a no-op values,+    e.g., 0 or `NULL`.++### SE_PATCH+* Backwards compatible bug fixes.++## Conventions+* Once a member is added to a struct, it cannot be removed, reordered, renamed,+  or repurposed (i.e., assigned a different functionality).+* "Renaming" a member is equivalent to adding a new member with a new name and+  eventually deprecating the member with the old name.+* Fields that cannot be 0 or `NULL` can be deprecated in a backwards compatible+  manner by zero-initialization. +  * Plug-ins must perform input validation on these fields for 0 and `NULL`+    before consuming them.+      * Plug-ins know the fields are deprecated when they find 0 or `NULL` in+        these fields.+  * Such fields must be explicitly marked by comments, to ensure all plug-ins+    have consistent behavior (e.g., non of the plug-ins is using 0 or `NULL` as

nit. non --> none

penpornk

comment created time in 17 days

PullRequestReviewEvent
PullRequestReviewEvent

push eventannarev/community

Edward Loper

commit sha 5b4e4847dda7f631a40ee36b4c4f33bf5dcafa26

RFC: StructTensor

view details

Edward Loper

commit sha bca8fa6223037708e3265d10907aa07e288c6002

markup fixes

view details

Edward Loper

commit sha 89506c0463bff39521418def4ce04ae35e31e145

Renamed StructTensor->StructuredTensor

view details

Edward Loper

commit sha 138c9c45145d6f644c153db1fc9ea8d74950cf2a

Updated Objective section to explicitly call out the tabular special case.

view details

Edward Loper

commit sha b9e805230020586ca61e9eba2686ec3a010d7788

Set status to accepted; Added section on how StructuredTensor relates to variables.

view details

Sami Kama

commit sha 8ae88c1f8ac844a75dc901c017d696e1f7fd7c16

Introducing Transactions extension to Modular Filesystems RFC

view details

Penporn Koanantakool

commit sha afa647822f4ce17cf388257f80642daa3d6ada2a

Add CSRSparseMatrix RFC.

view details

Penporn Koanantakool

commit sha 35744585850231e89b3127da0fa1850f407c2096

Adjust image sizes.

view details

Penporn Koanantakool

commit sha 4fc8600a53f11476a1384951a249b225ac5736dd

Fix an image's path.

view details

Penporn Koanantakool

commit sha 8f0975cd315da1615b9005139af8c63e1933838b

Add a discussion point about CompositeTensor.

view details

Penporn Koanantakool

commit sha d497090aa2bd6be6ce90f955e8705aadd024f684

Update the RFC #.

view details

Sami Kama

commit sha bab620fb1885f4ef96ef615cf37596e8299b4f29

Added updates from discussions

view details

Sean Morgan

commit sha 6a8f78967c37c5ebf27d5105233ccd99b25cd6bf

* gelu migration RFC

view details

Sean Morgan

commit sha 8fc2dd35cf0aa291c96ff519a5110bb6310b9f80

typos

view details

Sean Morgan

commit sha a3b8b6d8547c30ed1a6d51f3b6a4085cd43ea885

Spacing

view details

Sean Morgan

commit sha 8f1af584c6ce7f6ba6edc42e28413b50420ccfee

* Update transition

view details

Penporn Koanantakool

commit sha 67c0e2ac23ecb8aafc24be68877633e52bd7103d

Add information about documentation and tutorial.

view details

Penporn Koanantakool

commit sha 89f6244d77076e9737e6838f38cb38ef642e4ca6

Clarify why COO is not efficient with common TF sparsities

view details

Sean Morgan

commit sha 0a2db5dee61be987f740687e59eb1172785bdb06

Add Keras activations

view details

Sean Morgan

commit sha 1c2bca8fbc7f70f5763b946961147d9eb04c83a0

Add RFC number

view details

push time in 17 days

pull request commenttensorflow/community

RFC: Adding Pluggable Device For TensorFlow

thanks @annarev , I think unified_memory_allocate/unified_memory_deallocate is still need, AllocatorHandle is a higher level api, which allows plugin authors to implement a specific cached allocator, which is counter part of BFCAllocator. But for those plugin authors who want to reuse existing tensorflow BFC allocator, I think the low level allocation api in StreamExecutor(AllocateArray/UnifiedMemoryAllocate) is still need, We can make it: if plugin register a allocator handle, then proper will use allocate handle as its allocator, it not , BFC Allocator will be the default Allocator and it will be forward to StreamExecutor Allocation API through sub-allocator.

I wonder if it will get confusing with 2 separate ways to set allocator functionality. Need to think of a clear way to annotate which one to use when in the API.

jzhoulon

comment created time in 18 days

pull request commenttensorflow/community

RFC: StreamExecutor C API

I am planning to add the following SP_DeviceDescription struct. I made some fields optional and some required. @wchao1115, @kulinseth, @jzhoulon let me know if this looks reasonable (for e.g. if some other fields should be optional as well, such as pci_bus_id or if some optional fields should be required).

typedef struct SP_DeviceDescription {
  size_t struct_size;
  
  // Device hardware name. Used for printing.
  // Must be null-terminated.
  char* name;
  
  // Device vendor name. Used for printing.
  // Must be null-terminated.
  char* device_vendor;

  // Returns the PCI bus identifier for this device, of the form
  // [domain]:[bus]:[device].[function]
  // Used for printing.
  // Must be null-terminated.
  char* pci_bus_id;

  TF_Bool has_numa_node;  // True if `numa_node` is set.
  // Returns the NUMA node associated with this device, for use in
  // determining socket locality.
  int numa_node;

  TF_Bool has_memory_bandwidth;  // True if `memory_bandwidth` is set.
  // Device's memory bandwidth in bytes/sec.  (This is for reads/writes to/from
  // the device's own memory, not for transfers between the host and device.)
  int64_t memory_bandwidth;

  TF_Bool has_gflops;  // True if `gflops` field is set.
  // Estimate of floating point operations per second for this device * 10e-9.
  double gflops;
} SP_DeviceDescription;
annarev

comment created time in 18 days

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;  // free-form field filled by plugin+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;+} SP_Device;++#define SP_DEVICE_STRUCT_SIZE TF_OFFSET_OF_END(SP_Device, device_handle)++typedef struct SP_StreamExecutor {+  size_t struct_size;+  void* ext;++  /*** ALLOCATION CALLBACKS ***/+  // Synchronously allocates size bytes on the underlying platform and returns+  // a DeviceMemoryBase representing that allocation. In the case of failure,+  // nullptr is returned.+  // memory_space is reserved for a potential future usage and should be set+  // to 0.+  TF_DeviceMemoryBase* (*allocate)(+      SP_Device* se, uint64_t size, int64_t memory_space);+++  // Deallocate the device memory previously allocated via this interface.+  // Deallocation of a nullptr-representative value is permitted.+  void (*deallocate)(+      SP_Device* se, SE_DeviceMemoryBase* memory);+++  // Fill SP_AllocatorStats with allocator statistics.+  TF_BOOL (*get_allocator_stats)(SP_Device* executor,+                                 SP_AllocatorStats* stats);+  // Returns the underlying device memory usage information, if it is available.+  // If it is not available (false is returned), free/total may not be+  // initialized.+  TF_BOOL (*device_memory_usage)(+      SP_Device* executor, int64_t* free, int64_t* total);+  +  // Allocate host memory.+  void* (*host_memory_allocate)(TF_Device* device, uint64_t size);+  +  // Deallocate host memory.+  void (*host_memory_deallocate)(TF_Device* device, void *mem);++  // Allocates unified memory space of the given size, if supported. Support+  // should be added by setting `supports_unified_memory` field in+  // `DeviceDescription`.+  void* (*unified_memory_allocate)(TF_Device* device, uint64_t bytes);+  +  // Deallocates unified memory space previously allocated with+  // `unified_memory_allocate`.+  void (*unified_memory_deallocate)(TF_Device* device, void* location);++  /*** STREAM CALLBACKS ***/+  // Creates SE_Stream. This call should also Allocate stream+  // resources on the underlying platform and initializes its+  // internals.+  void (*create_stream)(SP_Device* executor, SP_Stream*, TF_Status*);++  // Destroys SE_Stream and deallocates any underlying resources.+  void (*destroy_stream)(SP_Device* executor, SP_Stream stream);++  // Causes dependent to not begin execution until other has finished its+  // last-enqueued work.+  TF_BOOL (*create_stream_dependency)(+      SP_Device* executor, SP_Stream dependent,+      SP_Stream other);++  // Without blocking the device, retrieve the current stream status.+  void (*get_status)(SP_Device* executor, SP_Stream stream,+                     TF_Status* status);++  /*** EVENT CALLBACKS ***/+  // Create SP_Event. Performs platform-specific allocation and initialization of an event.+  void (*create_event)(+      SP_Device* executor, SP_Event* event, TF_Status* status);++  // Destroy SE_Event and perform any platform-specific deallocation and cleanup of an event.+  void (*destroy_event)(+      SP_Device* executor, SP_Event event, TF_Status* status);++  // Requests the current status of the event from the underlying platform.+  SE_EventStatus (*poll_for_event_status)(+      SP_Device* executor, SP_Event event);+  // Inserts the specified event at the end of the specified stream.+  void (*record_event)(+      SP_Device* executor, SP_Stream stream,+      SP_Event event, TF_Status* status);++  // Wait for the specified event at the end of the specified stream.+  void (*wait_for_event)(+      SP_Device* executor, SP_Stream stream,+      SP_Event event, TF_Status* status);++  /*** TIMER CALLBACKS ***/+  // Creates TF_Timer. Allocates timer resources on the underlying platform and initializes its+  // internals, setting `timer` output variable. Sets values in `timer_fns` struct.+  void (*create_timer)(SP_Device* executor, SP_Timer* timer, SP_TimerFns* timer_fns, TF_Status* status);++  // Destroy timer and deallocates timer resources on the underlying platform.+  void (*destroy_timer)(SP_Device* executor, SP_Timer timer, SP_TimerFns* timer_fns);++  // Records a start event for an interval timer.+  TF_BOOL (*start_timer)(+      SP_Device* executor, SP_Stream stream, SP_Timer timer);+++  // Records a stop event for an interval timer.+  TF_BOOL (*stop_timer)(+      SP_Device* executor, SP_Stream stream, SP_Timer timer);++  /*** MEMCPY CALLBACKS ***/+  // Enqueues a memcpy operation onto stream, with a host destination location+  // host_dst and a device memory source, with target size size.+  TF_BOOL (*memcpy_dtoh)(+      SP_Device* executor, SP_Stream stream,+      void* host_dst,+      const SE_DeviceMemoryBase* device_src,+      uint64_t size);++  // Enqueues a memcpy operation onto stream, with a device destination location+  // and a host memory source, with target size size+  TF_BOOL (*memcpy_htod)(+      SP_Device* executor, SP_Stream stream,+      SE_DeviceMemoryBase* device_dst,+      const void* host_src, uint64_t size);+      +  // Enqueues a memcpy operation onto stream, with a device destination+  // location and a device memory source, with target size `size`.+  void (*memcpy_dtod)(const SP_Device* device, SP_Stream stream,+                      SP_DeviceMemoryBase* device_dst,+                      const SP_DeviceMemoryBase* device_src, uint64_t size,+                      TF_Status* status);+      +  // Blocks the caller while a data segment of the given size is+  // copied from the device source to the host destination.+  TF_BOOL (*sync_memcpy_dtoh)(+      SP_Device* executor,+      void* host_dst,+      const SE_DeviceMemoryBase* device_src,+      uint64_t size);++  // Blocks the caller while a data segment of the given size is+  // copied from the host source to the device destination.+  TF_BOOL (*sync_memcpy_htod)(+      SP_Device* executor,+      SE_DeviceMemoryBase* device_dst,+      const void* host_src, uint64_t size);+      +  // Blocks the caller while a data segment of the given size is copied from the+  // device source to the device destination.+  void (*sync_memcpy_dtod)(const SP_Device* device,+                           SP_DeviceMemoryBase* device_dst,+                           const SP_DeviceMemoryBase* device_src, uint64_t size,+                           TF_Status* status);++  // Causes the host code to synchronously wait for the event to complete.+  void (*block_host_for_event)(

Added here now: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/experimental/stream_executor/stream_executor.h;l=331?q=stream_executor.h

annarev

comment created time in 18 days

PullRequestReviewEvent

issue commenttensorflow/tensorflow

convert tesnorflow::ops::cast to tensorflow::tensor

One note though. Seems like you are using Session which doesn't seem to be a part of our public API. A better way to load a graph is using a saved model: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/cc/saved_model/loader.h

Adding @bmzhao who is working on new saved model implementation in case he has a better suggestion for your usecase.

naserpiltan

comment created time in 21 days

issue commenttensorflow/tensorflow

convert tesnorflow::ops::cast to tensorflow::tensor

Probably a faster way is to add Cast to the original GraphDef. Do you have the code that generated original GraphDef you were loading? Can you add a cast there instead?

You can also create a script that reads original GraphDef, adds Cast to it and outputs it in serialized format. Then, you can load this serialized GraphDef in the code you have above. It would already output a casted tensor. Using C API, that script could do something like:

#include "tensorflow/c/c_api.h"
#include "tensorflow/c/c_api_experimental.h"
#include "tensorflow/c/tf_status.h"
...

// Read original graph from input_graph_file
std::ifstream input_stream(input_graph_file);
std::stringstream buffer;
buffer << input_stream.rdbuf();
std::string graph_proto = buffer.str();

// Load GraphDef
TF_Graph* graph = TF_NewGraph();
TF_Buffer* serialized_graph_def =
      TF_NewBufferFromString(graph_proto.c_str(), graph_proto.size());
TF_Status* status = TF_NewStatus();
TF_ImportGraphDefOptions* options = TF_NewImportGraphDefOptions();
TF_GraphImportGraphDef(graph, serialized_graph_def, options, status);
TF_DeleteBuffer(serialized_graph_def);
CHECK(TF_GetCode(status) == TF_OK);

// Add Cast
TF_Operation* operation = TF_GraphOperationByName(graph, "SemanticPredictions:0");
CHECK(operation != nullptr);
TF_Output input = {operation, 0};
TF_OperationDescription* op = TF_NewOperation(graph, "Cast", "SemanticPredictionsInt64Cast");
TF_AddInput(op, input);
// DstT comes from here: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/ops/math_ops.cc;l=169;drc=02327c53328aa04e590b21adc55142778b1e90da
TF_SetAttrType(op, "DstT", TF_INT64);
TF_FinishOperation(op, status);
CHECK(TF_GetCode(status) == TF_OK);

// Output updated graph to output_graph_file
TF_Buffer* output_graph_def = TF_NewBuffer();
TF_GraphToGraphDef(graph, output_graph_def, status);
CHECK(TF_GetCode(status) == TF_OK);
std::ofstream output;
output.open(output_graph_file);
std::string output_graph_def_str(static_cast<const char*>(output_graph_def->data), output_graph_def->length);
output << output_graph_def_str;
output.close();

TF_DeleteBuffer(output_graph_def);
TF_DeleteGraph(graph);
TF_DeleteImportGraphDefOptions(options);
TF_DeleteStatus(status);
naserpiltan

comment created time in 21 days

pull request commenttensorflow/community

RFC: Adding Pluggable Device For TensorFlow

The reason I brought this up was that BFC allocation strategy might not play best with Metal. And I was hoping if we can just register custom allocator.

@annarev do you have insight on this?

Registering a custom allocator sounds good to me and proposed implementation sounds good (thank you for coming up with it!). I guess we can treat allocate_handle similar to the way we treat, say, stream. We can typedef SP_AllocatorHandle SP_AllocatorHandle_st. Then we can add the following to SP_StreamExecutor:

void(*allocate_raw)(SP_AllocatorHandle allocator_handle, size_t alignment, size_t num_bytes);
void(*deallocate_raw)(SP_AllocatorHandle allocator_handle, void* ptr);
jzhoulon

comment created time in 22 days

issue commenttensorflow/tensorflow

convert tesnorflow::ops::cast to tensorflow::tensor

Seems like Cast creates a Node which needs to output a Tensor. The following works for me:

  auto root = tensorflow::Scope::NewRootScope();
  tensorflow::ClientSession session(root);
  std::vector<tensorflow::Tensor> int_outputs;
  auto int64_caster_output =
      Cast(root.WithOpName("int64_caster"), outputs[0], tensorflow::DT_INT32);
  tensorflow::Status status = session.Run({int64_caster_output}, &int_outputs);
  if (!status.ok()) {
    std::cerr << "Casting failed" << std::endl;
    return 1;
  }
  std::cout << int_outputs[0].DebugString() << std::endl;
naserpiltan

comment created time in 23 days

pull request commenttensorflow/community

RFC: Adding Pluggable Device For TensorFlow

@kulinseth thanks for the comment.

Thanks for the response.

Tensorflow has 64 bytes alignment requirement, for the backend, I think the alignment depends on the plugin implementation though it may still need to keep at least 64 bytes align. See the case of GPUMemAllocator, streamExecutor Allocation API did not accept the alignment parameter. backend implementation can have its own alignment, 64, 128. 256...

  void* Alloc(size_t alignment, size_t num_bytes) override {
    void* ptr = nullptr;
    if (num_bytes > 0) {
      if (use_unified_memory_) {
        ptr = stream_exec_->UnifiedMemoryAllocate(num_bytes);
      } else {
        ptr = stream_exec_->AllocateArray<char>(num_bytes).opaque();
      }
      VisitAlloc(ptr, gpu_id_.value(), num_bytes);
    }
    return ptr;
  }

Right I think the 64 byte alignment is in conjunction with EIGEN_MAX_ALIGN_BYTES which is also 64 I believe. This seems reasonable. Although, I noticed https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/core/common_runtime/gpu/gpu_device.cc;l=146 It probably is not applicable in the Pluggable scenario, I just want to make sure it doesn't break some assumptions which are made in Tensorflow core.

The allocate API you list is a member EigenGpuStreamDevice, which is an allocator for EigenGPUDevice , PluggableDevice does not bind to Eigen, plugin itself will choose eigen or other math library. So I don't think PluggableDevice will have this issue.

I prefer keeping the 64 byte requirement for initial implementation since it is used in multiple central places in TF. We could always revisit and look into allowing custom alignment later on.

jzhoulon

comment created time in 24 days

issue commenttensorflow/tensorflow

Protobuf MergeFrom errors when tensorflow and python protobuf share a libprotobuf copy

Thank you for detailed instructions and bisecting! I can repro the issue and how it happens due to pywrap_tensorflow import removal. I will send a change to re-add it.

sclarkson

comment created time in a month

issue closedtensorflow/tensorflow

TF 2.0: API symbol renames design review.

TensorFlow 1.x has over 2000 endpoints total including over 500 endpoints in the root namespace. As number of symbols grows, it is important to maintain a clear structure to aid discoverability.

Certain API symbol placements could be improved:

  • Some namespaces were created recently and might not contain all the corresponding symbols. For e.g. tf.math namespace was added recently. Symbols such as tf.round are not in tf.math namespace even though logically they belong in that namespace.
  • Some symbols are included in the root namespace even though they are rarely used (for e.g. tf.zeta).
  • Some symbols currently start with a prefix that could really be replaced by introducing a subnamespace (for e.g. tf.string_strip vs tf.strings.strip, tf.sparse_maximum vs tf.sparse.maximum).
  • Certain deep hierarchies seem redundant and could be flattened (for e.g. tf.saved_model.signature_constants.CLASSIFY_INPUTS could be moved to tf.saved_model.CLASSIFY_INPUTS).
  • To keep clear structure and reduce duplication, we want to collect all layers, losses and metrics under the tf.keras namespace.
  • In general, we want to balance flatness and browsability. Flat hierarchies allow for shorter endpoint names that are easy to remember (for e.g. tf.add vs tf.math.add). At the same time subnamespaces support easier browsability (for e.g. tf.math namespace would contain all math functions making it easier to discover available symbols).

Additional information about API symbol renames in TensorFlow 2.0 can be found in the RFC here.

closed time in a month

dynamicwebpaige

issue commenttensorflow/tensorflow

TF 2.0: API symbol renames design review.

This has been done a while back. RFC: https://github.com/tensorflow/community/pull/16

dynamicwebpaige

comment created time in a month

issue commenttensorflow/tensorflow

TF 2.0: Release binaries for Python 3.5 and 3.7.

I checked https://pypi.org/project/tensorflow/#files so I will assume this issue went away. Feel free to open a new bug if the issue resurfaces.

dynamicwebpaige

comment created time in a month

issue commenttensorflow/tensorflow

make tf.image ops support tf.float16 in tensorflow 2.0

Reassigning to @hyeygit since she has been working on image APIs recently.

mttbx

comment created time in a month

issue commenttensorflow/tensorflow

Cannot confine TensorFlow C API to use one and only one core

@penpornk do you know if there is away to use just one thread? Feel free to reassign if you are not the right owner for this issue.

fisakhan

comment created time in a month

PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;++  // Any kind of data that plugin device might want to store.+  void* ext;++} SP_Device;++#define SP_DEVICE_STRUCT_SIZE TF_OFFSET_OF_END(SP_Device, data)++typedef struct SP_StreamExecutor {+  size_t struct_size;+  void* ext;++  /*** ALLOCATION CALLBACKS ***/+  // Synchronously allocates size bytes on the underlying platform and returns+  // a DeviceMemoryBase representing that allocation. In the case of failure,+  // nullptr is returned.+  // memory_space is reserved for a potential future usage and should be set+  // to 0.+  TF_DeviceMemoryBase* (*allocate)(

The checked in code here has been updated to include allocate_unified_memory, deallocate_unified_memory functions. Also, support_unified_memory is added to SP_Platform (+ some refactoring). https://github.com/tensorflow/tensorflow/commit/b6cc708f01ffd7690da74456a230a6cdc64db8d6

I haven't propagated supports_unified_memory to DeviceDescription yet, probably in next change.

annarev

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

Summary op bm

 tf_cc_test(     ], ) +tf_cc_test(+    name = "summary_op_benchmark_test",+    size = "small",+    srcs = ["summary_op_benchmark_test.cc"],+    deps = [+        ":summary_op",+        "//tensorflow/c:kernels",

Looks like you added it to summary_op_test but it should be added to summary_op_benchmark_test.

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Summary op bm

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <string>++#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"+#include "tensorflow/core/framework/tensor.h"+#include "tensorflow/core/graph/node_builder.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/core/platform/test_benchmark.h"+#include "tensorflow/core/framework/tensor_shape.h"++namespace tensorflow {+namespace {++Graph* BM_ScalarSummaryOp(TensorShape shape, std::string tag, +                          float value) {+  Graph* g = new Graph(OpRegistry::Global());+  Tensor tags(DT_STRING, shape);+  Tensor values(DT_FLOAT, shape);+  for (int i = 0; i < tags.NumElements(); ++i){ +    tags.flat<tstring>()(i) = tag; +    values.flat<float>()(i) = value; +  } +  Node* ret;+  TF_CHECK_OK(NodeBuilder(g->NewName("dummy"), "ScalarSummary")+                  .Input(test::graph::Constant(g, tags))+                  .Input(test::graph::Constant(g, values))+                  .Attr("T", DT_FLOAT)+                  .Finalize(g, &ret));+  return g;+}++// Macro used to parse initializer list for tensorshape +#define DIMARGS(...) {__VA_ARGS__}+// Random parameters for testing+constexpr char longTagParam[] = "LONGTAG____________________________"; +constexpr float largeValueParam = 2352352.2623433; ++#define BM_ScalarSummaryDev(device, dims, name, tag, value)           \+  void BM_ScalarSummary##name##device(int iters) {                    \+    testing::StopTiming();                                            \+    TensorShape tensorshape(DIMARGS(dims));                           \+    auto g = BM_ScalarSummaryOp(tensorshape, #tag, value);            \+    testing::StartTiming();                                           \+    test::Benchmark("cpu", g).Run(iters);                             \+  }                                                                   \+  BENCHMARK(BM_ScalarSummary##name##device); ++BM_ScalarSummaryDev(Cpu, (5, 10, 100), Base, Tag, 5.2);+// Benchmark for large shapes +BM_ScalarSummaryDev(Cpu, (500, 1000, 10000), LargeShape, Tag, 5.2);

Can you use a smaller shape? This runs out of memory on my machine. May be 500, 100, 100

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Summary op bm

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <string>++#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"+#include "tensorflow/core/framework/tensor.h"+#include "tensorflow/core/graph/node_builder.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/core/platform/test_benchmark.h"+#include "tensorflow/core/framework/tensor_shape.h"++namespace tensorflow {+namespace {++Graph* BM_ScalarSummaryOp(TensorShape shape, std::string tag, +                          float value) {+  Graph* g = new Graph(OpRegistry::Global());+  Tensor tags(DT_STRING, shape);+  Tensor values(DT_FLOAT, shape);+  for (int i = 0; i < tags.NumElements(); ++i){ +    tags.flat<tstring>()(i) = tag; +    values.flat<float>()(i) = value; +  } +  Node* ret;+  TF_CHECK_OK(NodeBuilder(g->NewName("dummy"), "ScalarSummary")+                  .Input(test::graph::Constant(g, tags))+                  .Input(test::graph::Constant(g, values))+                  .Attr("T", DT_FLOAT)+                  .Finalize(g, &ret));+  return g;+}++// Macro used to parse initializer list for tensorshape +#define DIMARGS(...) {__VA_ARGS__}+// Random parameters for testing+constexpr char longTagParam[] = "LONGTAG____________________________"; +constexpr float largeValueParam = 2352352.2623433; ++#define BM_ScalarSummaryDev(device, dims, name, tag, value)           \+  void BM_ScalarSummary##name##device(int iters) {                    \+    testing::StopTiming();                                            \+    TensorShape tensorshape(DIMARGS(dims));                           \+    auto g = BM_ScalarSummaryOp(tensorshape, #tag, value);            \+    testing::StartTiming();                                           \+    test::Benchmark("cpu", g).Run(iters);                             \+  }                                                                   \+  BENCHMARK(BM_ScalarSummary##name##device); ++BM_ScalarSummaryDev(Cpu, (5, 10, 100), Base, Tag, 5.2);+// Benchmark for large shapes +BM_ScalarSummaryDev(Cpu, (500, 1000, 10000), LargeShape, Tag, 5.2);+// Benchmark for large tag tstring +BM_ScalarSummaryDev(Cpu, (5, 10, 100), LongTag, LongTagParam, 5.2);+// Benchmark for large values +BM_ScalarSummaryDev(Cpu, (500, 1000, 10000), LargeValue, Tag, largeValueParam);

Use a smaller shape here.

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Summary op bm

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <string>++#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"+#include "tensorflow/core/framework/tensor.h"+#include "tensorflow/core/graph/node_builder.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/core/platform/test_benchmark.h"+#include "tensorflow/core/framework/tensor_shape.h"++namespace tensorflow {+namespace {++Graph* BM_ScalarSummaryOp(TensorShape shape, std::string tag, +                          float value) {+  Graph* g = new Graph(OpRegistry::Global());+  Tensor tags(DT_STRING, shape);+  Tensor values(DT_FLOAT, shape);+  for (int i = 0; i < tags.NumElements(); ++i){ +    tags.flat<tstring>()(i) = tag; +    values.flat<float>()(i) = value; +  } +  Node* ret;+  TF_CHECK_OK(NodeBuilder(g->NewName("dummy"), "ScalarSummary")+                  .Input(test::graph::Constant(g, tags))+                  .Input(test::graph::Constant(g, values))+                  .Attr("T", DT_FLOAT)+                  .Finalize(g, &ret));+  return g;+}++// Macro used to parse initializer list for tensorshape +#define DIMARGS(...) {__VA_ARGS__}+// Random parameters for testing+constexpr char longTagParam[] = "LONGTAG____________________________"; +constexpr float largeValueParam = 2352352.2623433; ++#define BM_ScalarSummaryDev(device, dims, name, tag, value)           \+  void BM_ScalarSummary##name##device(int iters) {                    \+    testing::StopTiming();                                            \+    TensorShape tensorshape(DIMARGS(dims));                           \

I am seeing "expression result unused" warning. But if I rewrite it as DIMARGS dims then it works. Can you change it?

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Summary op bm

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <string>++#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"+#include "tensorflow/core/framework/tensor.h"+#include "tensorflow/core/graph/node_builder.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/core/platform/test_benchmark.h"+#include "tensorflow/core/framework/tensor_shape.h"++namespace tensorflow {+namespace {++Graph* BM_ScalarSummaryOp(TensorShape shape, std::string tag, +                          float value) {+  Graph* g = new Graph(OpRegistry::Global());+  Tensor tags(DT_STRING, shape);+  Tensor values(DT_FLOAT, shape);+  for (int i = 0; i < tags.NumElements(); ++i){ +    tags.flat<tstring>()(i) = tag; +    values.flat<float>()(i) = value; +  } +  Node* ret;+  TF_CHECK_OK(NodeBuilder(g->NewName("dummy"), "ScalarSummary")+                  .Input(test::graph::Constant(g, tags))+                  .Input(test::graph::Constant(g, values))+                  .Attr("T", DT_FLOAT)+                  .Finalize(g, &ret));+  return g;+}++// Macro used to parse initializer list for tensorshape +#define DIMARGS(...) {__VA_ARGS__}+// Random parameters for testing+constexpr char longTagParam[] = "LONGTAG____________________________"; 

This variable doesn't seem to be used.

dnguyen28061

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

Summary op bm

 tf_cc_test(     ], ) +tf_cc_test(+    name = "summary_op_benchmark_test",+    size = "small",+    srcs = ["summary_op_benchmark_test.cc"],+    deps = [+        ":summary_op",+        "//tensorflow/c:kernels",

Add "//tensorflow/core:core_cpu", Looks like it is needed for node_builder.h include.

dnguyen28061

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

Summary op bm

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.++Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at++    http://www.apache.org/licenses/LICENSE-2.0++Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <string>++#include "tensorflow/core/common_runtime/kernel_benchmark_testlib.h"+#include "tensorflow/core/framework/tensor.h"+#include "tensorflow/core/graph/node_builder.h"+#include "tensorflow/core/platform/test.h"+#include "tensorflow/core/platform/test_benchmark.h"+#include "tensorflow/core/framework/tensor_shape.h"++namespace tensorflow {+namespace {++Graph* BM_ScalarSummaryOp(TensorShape shape, std::string tag, +                          float value) {+  Graph* g = new Graph(OpRegistry::Global());+  Tensor tags(DT_STRING, shape);+  Tensor values(DT_FLOAT, shape);+  for (int i = 0; i < tags.NumElements(); ++i){ +  	tags.flat<tstring>()(i) = tag; +  	values.flat<float>()(i) = value; +  } +  Node* ret;+  TF_CHECK_OK(NodeBuilder(g->NewName("dummy"), "ScalarSummary")+                  .Input(test::graph::Constant(g, tags))+                  .Input(test::graph::Constant(g, values))+                  .Attr("T", DT_FLOAT)+                  .Finalize(g, &ret));+  return g;+}++// Macro used to parse initializer list for tensorshape +#define DIMARGS(...) {__VA_ARGS__}+// Random parameters for testing+constexpr char longTagParam[] = "LONGTAG____________________________"; +constexpr float largeValueParam = 2352352.2623433; ++#define BM_ScalarSummaryDev(device, dims, name, tag, value)       		\+	void BM_ScalarSummary##name##device(int iters) { 	                  \

Some parts of this macro are not well-aligned. For e.g. function body should be indented by 2 spaces relative to function header. Also, I can't tell if it

Also, can you call testing::StopTiming(); at the start of this function and testing::SartTiming() right before test::Benchmark call? Basically, we just want to measure how long it takes to run test::Benchmark. That's why we want to stop timing as soon as we can and start it right before test::Benchmark call.

dnguyen28061

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;++  // Any kind of data that plugin device might want to store.+  void* ext;++} SP_Device;++#define SP_DEVICE_STRUCT_SIZE TF_OFFSET_OF_END(SP_Device, data)++typedef struct SP_StreamExecutor {+  size_t struct_size;+  void* ext;++  /*** ALLOCATION CALLBACKS ***/+  // Synchronously allocates size bytes on the underlying platform and returns+  // a DeviceMemoryBase representing that allocation. In the case of failure,+  // nullptr is returned.+  // memory_space is reserved for a potential future usage and should be set+  // to 0.+  TF_DeviceMemoryBase* (*allocate)(

Just to make sure, is unified memory a memory address space that can be accessed by any device?

Unrelated to this specific question, but so far I haven't added DeviceDescription to the API. I wasn't sure if it is useful. Do you think any of the other fields in DeviceDescription would be useful for pluggable devices or just support_unified_memory: https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/device_description.h;l=40?q=DeviceDescription%20streamexecutor

annarev

comment created time in a month

PullRequestReviewEvent

push eventannarev/community

Anna Revinskaya

commit sha b42f06e689786262e96a932973732c1f4ec696c7

Added unified memory allocate/deallocate

view details

push time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

Merge summary kernel with C API

+/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.+Licensed under the Apache License, Version 2.0 (the "License");+you may not use this file except in compliance with the License.+You may obtain a copy of the License at+    http://www.apache.org/licenses/LICENSE-2.0+Unless required by applicable law or agreed to in writing, software+distributed under the License is distributed on an "AS IS" BASIS,+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.+See the License for the specific language governing permissions and+limitations under the License.+==============================================================================*/++#include <sstream>+#include <unordered_set>+#include <memory>++#include "tensorflow/c/kernels.h"+#include "tensorflow/c/tf_tensor.h"+#include "tensorflow/c/tf_status.h"+#include "tensorflow/core/framework/selective_registration.h"+#include "tensorflow/core/framework/summary.pb.h"+#include "tensorflow/core/framework/types.h"+#include "tensorflow/core/platform/macros.h"+#include "tensorflow/core/platform/protobuf.h"+#include "tensorflow/core/platform/tstring.h"+#include "tensorflow/core/platform/default/logging.h"

Change #include "tensorflow/core/platform/default/logging.h" to #include "tensorflow/core/platform/logging.h"

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Merge summary kernel with C API

 tf_kernel_library(     ], ) +

Remove blank line here. Basically, keep 1 line spacing between targets.

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Merge summary kernel with C API

 tf_kernel_library(         "//third_party/eigen3",     ], )-

Add blank line here.

dnguyen28061

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/tensorflow

Added TF_ForwardInputOrAllocateOutput to Kernel C API

 TEST_F(DeviceKernelOpTest, TestAllocateTempSize2x3) {             output->DebugString(100)); } +TEST_F(DeviceKernelOpTest, TestForwardInputOrAllocateOutput) {+  const char* node_name = "TestForwardInputOrAllocateOutputKernel";+  const char* op_name = "BazOp";+  const char* device_name = "FakeDeviceName";++  REGISTER_OP(op_name)+      .Input("input1: float")+      .Input("input2: float")+      .Output("output1: float")+      .Attr("SomeDataTypeAttr: type");;

Nit. remove extra semicolon

dnguyen28061

comment created time in a month

Pull request review commenttensorflow/tensorflow

Added TF_ForwardInputOrAllocateOutput to Kernel C API

 TEST_F(DeviceKernelOpTest, TestAllocateTempSize2x3) {             output->DebugString(100)); } +TEST_F(DeviceKernelOpTest, TestForwardInputOrAllocateOutput) {+  const char* node_name = "TestForwardInputOrAllocateOutputKernel";+  const char* op_name = "BazOp";+  const char* device_name = "FakeDeviceName";++  REGISTER_OP(op_name)+      .Input("input1: float")+      .Input("input2: float")+      .Output("output1: float")+      .Attr("SomeDataTypeAttr: type");;++  // A kernel whose Compute function that forwards a scalar input to output+  auto my_compute_func = [](void* kernel, TF_OpKernelContext* ctx) {+    TF_Status* s = TF_NewStatus();+    int candidate_input_indices[1] = {0}; +    int forwarded_input; +    int64_t output_dims[1] = {};+    TF_Tensor* output = TF_ForwardInputOrAllocateOutput(/*context=*/ctx, +        candidate_input_indices, /*num_candidate_input_indices=*/1, +        /*output_index=*/0, output_dims, /*output_num_dims=*/0, +        &forwarded_input, /*status=*/s); +    EXPECT_EQ(TF_OK, TF_GetCode(s));+    EXPECT_EQ(forwarded_input, 0);+    EXPECT_EQ(TF_FLOAT, TF_TensorType(output));+    EXPECT_EQ(0, TF_NumDims(output));+    TF_DeleteStatus(s);

Also, call TF_DeleteTensor(output);

dnguyen28061

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;++  // Any kind of data that plugin device might want to store.+  void* ext;

good eye

annarev

comment created time in a month

PullRequestReviewEvent

push eventannarev/community

Anna Revinskaya

commit sha 31de234a153eb36c84de3a9045ed57e3b8501f05

Add device-to-device memcopies, removed extra void* ext

view details

push time in a month

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;++  // Any kind of data that plugin device might want to store.+  void* ext;++} SP_Device;++#define SP_DEVICE_STRUCT_SIZE TF_OFFSET_OF_END(SP_Device, data)++typedef struct SP_StreamExecutor {+  size_t struct_size;+  void* ext;++  /*** ALLOCATION CALLBACKS ***/+  // Synchronously allocates size bytes on the underlying platform and returns+  // a DeviceMemoryBase representing that allocation. In the case of failure,+  // nullptr is returned.+  // memory_space is reserved for a potential future usage and should be set+  // to 0.+  TF_DeviceMemoryBase* (*allocate)(

Sounds good. Does the unified_memory_allocate/unified_memory_deallocate name still sound good or should we use a different name?

annarev

comment created time in a month

PullRequestReviewEvent

Pull request review commenttensorflow/community

RFC: StreamExecutor C API

+# StreamExecutor C API++| Status        | Proposed                                                |+| :------------ | :------------------------------------------------------ |+| **RFC #**     | [257](https://github.com/tensorflow/community/pull/257) |+| **Author(s)** | Anna Revinskaya (annarev@google.com), Penporn Koanantakool (penporn@google.com), Yi Situ (yisitu@google.com), Russell Power (power@google.com) |+| **Sponsor**   | Gunhan Gulsoy (gunan@google.com)                        |+| **Updated**   | 2020-07-15                                              |++# Objective++Provide basic device management C API to allow new devices to modularly connect+to the current TensorFlow runtime.++## Goals++*   C API wrapper of a subset of methods in StreamExecutorInterface.+*   Best-effort API and ABI stability after an initial experimental phase.++## Non-goals++*   Compatibility with the+    [new TensorFlow runtime stack](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html).+*   APIs that will expose all device-specific capabilities. ++# Motivation++Current device support in TensorFlow adds code directly into the+[main TensorFlow repository](http://github.com/tensorflow/tensorflow). This+approach is+[not scalable](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md#adding-support-for-new-hardware-is-very-difficult-and-not-scalable)+because it adds complexity to the build dependency and tool chains, takes longer+time to build, and requires the TensorFlow team’s review. To handle the surge in+new hardware accelerators and programming paradigms, TensorFlow must allow+device addition in a modular manner: contributors code outside of the TensorFlow+repository and distribute a binary module which would connect to TensorFlow at+runtime through a stable application binary interface (ABI).++The new TensorFlow stack, based on+[TFRT](https://blog.tensorflow.org/2020/04/tfrt-new-tensorflow-runtime.html) and+[MLIR](https://www.tensorflow.org/mlir), is designed with this in mind. However,+it is still in an active development phase and is not ready for third-party+device integration until later this year. (For device support expecting to land+in 2021 or later, we highly recommend waiting to integrate with the new stack,+since it is fundamentally different from the current stack and cannot guarantee+code reuse.)++In the meantime, we plan to provide limited device integration support for the+current TensorFlow stack through+[Modular TensorFlow](https://github.com/tensorflow/community/blob/master/rfcs/20190305-modular-tensorflow.md).+We anticipate three basic functionalities within a device plugin module:++*   Device registration: Addressed in a different RFC, [Adding Pluggable Device for TensorFlow](https://github.com/tensorflow/community/pull/262).+*   Device management: The focus of this RFC.+*   Kernel and op registration and implementation:+    [RFC Accepted](https://github.com/tensorflow/community/blob/master/rfcs/20190814-kernel-and-op-registration.md). [C API implemented](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/c/).++[StreamExecutor](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_pimpl.h;l=73) is TensorFlow's main device manager, responsible for work execution and memory management. It provides a set of methods+(such as+[Memcpy](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=240))+that can be customized for a particular device.++We propose a C API wrapper of a subset of methods in+[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+as an ABI-stable way to register a custom StreamExecutor platform.++# User Benefits++A decoupled way to add a new device to TensorFlow.++*   Simpler process: Does not have to add a new build toolchain to TensorFlow+*   Faster time-to-solution: Does not need code review from the TensorFlow team.+*   Lower maintenance efforts: Only C-API-related changes could break the+    integration. Unrelated TensorFlow changes would not break the code.+       *    The C APIs may be changed during the initial experimental phase based +            on developer experience and feedback. When the APIs become more mature, +            we will try to keep them stable (in a best-effort manner) until the new +            TensorFlow stack is available.++# Design Proposal++## StreamExecutorInterface++[StreamExecutorInterface](https://cs.opensource.google/tensorflow/tensorflow/+/master:tensorflow/stream_executor/stream_executor_internal.h;l=166?q=StreamExecutorinterface)+is quite large and some of its methods are only sporadically used. Therefore, we+plan to wrap only a subset of key StreamExecutorInterface functionality. We decided on this subset based on the PluggableDevice usecase as well as potential future devices such as TPUs.++Implementation conventions:++* Structs include `struct_size` parameter. This parameter should be filled in both by TensorFlow and the plugin and can be checked to determine which struct fields are available for current version of TensorFlow.+* Struct name prefixes indicates which side of the API is responsible for populating the struct:+  * `SE_` prefix: filled by TensorFlow.+  * `SP_` prefix: filled by plugins, except `struct_size` which is also filled by TensorFlow when TensorFlow passes it to a callback.++See proposed C API below:++```cpp+#include <stddef.h>+#include <stdint.h>++#define SE_MAJOR 0+#define SE_MINOR 0+#define SE_REVISION 1++#ifdef __cplusplus+extern "C" {+#endif++typedef SP_Stream_st* SP_Stream;+typedef SP_Event_st* SP_Event;+typedef SP_Timer_st* SP_Timer;+typedef TF_Status* (*TF_StatusCallbackFn)(void*);++#ifndef TF_BOOL_DEFINED+#define TF_BOOL unsigned char+#endif // TF_BOOL_DEFINED++#ifndef TF_OFFSET_OF_END+#define TF_OFFSET_OF_END(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))+#endif // TF_OFFSET_OF_END++typedef struct SP_TimerFns {+  size_t struct_size;+  void* ext;+  uint64_t (*nanoseconds)(SE_Timer timer);+  uint64_t (*microseconds)(SE_Timer timer);+} SP_TimerFns;++#define SP_TIMER_FNS_STRUCT_SIZE TF_OFFSET_OF_END(SP_TimerFns, microseconds)++typedef struct SP_AllocatorStats {+  size_t struct_size;+  void* ext;+  int64_t num_allocs;+  int64_t bytes_in_use;+  int64_t peak_bytes_in_use;+  int64_t largest_alloc_size;++  int8_t has_bytes_limit;+  int64_t bytes_limit;++  int64_t bytes_reserved;+  int64_t peak_bytes_reserved;++  int8_t has_bytes_reservable_limit;+  int64_t bytes_reservable_limit;++  int64_t largest_free_block_bytes;+} SP_AllocatorStats;++#define SP_ALLOCATORSTATS_STRUCT_SIZE TF_OFFSET_OF_END(SP_AllocatorStats, largest_free_block_bytes)++typedef enum SE_EventStatus {+  SE_EVENT_UNKNOWN,+  SE_EVENT_ERROR,+  SE_EVENT_PENDING,+  SE_EVENT_COMPLETE,+} SE_EventStatus;++typedef struct SE_Options {+  size_t struct_size;+  void* ext;+  int32_t ordinal;+} SE_Options;++#define SE_OPTIONS_STRUCT_SIZE TF_OFFSET_OF_END(SE_Options, ordinal)++typedef struct SE_DeviceMemoryBase {+  size_t struct_size;+  void* ext;+  void* opaque;+  uint64_t size;+  uint64_t payload;+} SE_DeviceMemoryBase;++#define SE_DEVICE_MEMORY_BASE_STRUCT_SIZE TF_OFFSET_OF_END(SE_DeviceMemoryBase, payload)++typedef struct SP_Device {+  size_t struct_size;+  void* ext;+  const char* name;+  size_t name_len;++  // Device vendor can store handle to their device representation+  // here.+  void* device_handle;++  // Any kind of data that plugin device might want to store.+  void* ext;++} SP_Device;++#define SP_DEVICE_STRUCT_SIZE TF_OFFSET_OF_END(SP_Device, data)++typedef struct SP_StreamExecutor {+  size_t struct_size;+  void* ext;++  /*** ALLOCATION CALLBACKS ***/+  // Synchronously allocates size bytes on the underlying platform and returns+  // a DeviceMemoryBase representing that allocation. In the case of failure,+  // nullptr is returned.+  // memory_space is reserved for a potential future usage and should be set+  // to 0.+  TF_DeviceMemoryBase* (*allocate)(+      SP_Device* se, uint64_t size, int64_t memory_space);+++  // Deallocate the device memory previously allocated via this interface.+  // Deallocation of a nullptr-representative value is permitted.+  void (*deallocate)(+      SP_Device* se, SE_DeviceMemoryBase* memory);+++  // Fill SP_AllocatorStats with allocator statistics.+  TF_BOOL (*get_allocator_stats)(SP_Device* executor,+                                 SP_AllocatorStats* stats);+  // Returns the underlying device memory usage information, if it is available.+  // If it is not available (false is returned), free/total may not be+  // initialized.+  TF_BOOL (*device_memory_usage)(+      SP_Device* executor, int64_t* free, int64_t* total);+  +  // Allocate host memory.+  void* (*host_memory_allocate)(uint64_t size);+  +  // Deallocate host memory.+  void (*host_memory_deallocate)(void *mem);+++  /*** STREAM CALLBACKS ***/+  // Creates SE_Stream. This call should also Allocate stream+  // resources on the underlying platform and initializes its+  // internals.+  void (*create_stream)(SP_Device* executor, SP_Stream*, TF_Status*);++  // Destroys SE_Stream and deallocates any underlying resources.+  void (*destroy_stream)(SP_Device* executor, SP_Stream stream);++  // Causes dependent to not begin execution until other has finished its+  // last-enqueued work.+  TF_BOOL (*create_stream_dependency)(+      SP_Device* executor, SP_Stream dependent,+      SP_Stream other);++  // Without blocking the device, retrieve the current stream status.+  void (*get_status)(SP_Device* executor, SP_Stream stream,+                     TF_Status* status);++  /*** EVENT CALLBACKS ***/+  // Create SP_Event. Performs platform-specific allocation and initialization of an event.+  void (*create_event)(+      SP_Device* executor, SP_Event* event, TF_Status* status);++  // Destroy SE_Event and perform any platform-specific deallocation and cleanup of an event.+  void (*destroy_event)(+      SP_Device* executor, SP_Event event, TF_Status* status);++  // Requests the current status of the event from the underlying platform.+  SE_EventStatus (*poll_for_event_status)(+      SP_Device* executor, SP_Event event);+  // Inserts the specified event at the end of the specified stream.+  void (*record_event)(+      SP_Device* executor, SP_Stream stream,+      SP_Event event, TF_Status* status);++  // Wait for the specified event at the end of the specified stream.+  void (*wait_for_event)(+      SP_Device* executor, SP_Stream stream,+      SP_Event event, TF_Status* status);++  /*** TIMER CALLBACKS ***/+  // Creates TF_Timer. Allocates timer resources on the underlying platform and initializes its+  // internals, setting `timer` output variable. Sets values in `timer_fns` struct.+  void (*create_timer)(SP_Device* executor, SP_Timer* timer, SP_TimerFns* timer_fns, TF_Status* status);++  // Destroy timer and deallocates timer resources on the underlying platform.+  void (*destroy_timer)(SP_Device* executor, SP_Timer timer, SP_TimerFns* timer_fns);++  // Records a start event for an interval timer.+  TF_BOOL (*start_timer)(+      SP_Device* executor, SP_Stream stream, SP_Timer timer);+++  // Records a stop event for an interval timer.+  TF_BOOL (*stop_timer)(+      SP_Device* executor, SP_Stream stream, SP_Timer timer);++  /*** MEMCPY CALLBACKS ***/+  // Enqueues a memcpy operation onto stream, with a host destination location+  // host_dst and a device memory source, with target size size.+  TF_BOOL (*memcpy_to_host)(+      SP_Device* executor, SP_Stream stream,+      void* host_dst,+      const SE_DeviceMemoryBase* device_src,+      uint64_t size);++  // Enqueues a memcpy operation onto stream, with a device destination location+  // and a host memory source, with target size size+  TF_BOOL (*memcpy_from_host)(+      SP_Device* executor, SP_Stream stream,+      SE_DeviceMemoryBase* device_dst,+      const void* host_src, uint64_t size);+      +  // Blocks the caller while a data segment of the given size is+  // copied from the device source to the host destination.+  TF_BOOL (*sync_memcpy_to_host)(+      SP_Device* executor,+      void* host_dst,+      const SE_DeviceMemoryBase* device_src,+      uint64_t size);++  // Blocks the caller while a data segment of the given size is+  // copied from the host source to the device destination.+  TF_BOOL (*sync_memcpy_from_host)(

Yep, I actually just submitted the implementation with device-to-device today: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c/experimental/stream_executor/stream_executor.h

annarev

comment created time in a month

PullRequestReviewEvent
more