fix hip support #25

zhangxiao-stack · 2023-01-05T10:24:29Z

fix hip support

* decompose disc-compiler * update Co-authored-by: Yan Xu <[email protected]>

* [xla][mlir][sparse] override sparse shape behavior for xla runtime path PiperOrigin-RevId: 506126261 * Create a codec class for built-in `TypeSpec`s to register, to make `TypeSpec` classes follow the codec structure used by the rest of `nested_structure_coder.py`. Also remove `nested_structure_coder.py`'s dependency on `dataset_ops.DatasetSpec`, `values.PerReplicaSpec`, `iterator_ops.IteratorSpec`, and `optional_ops.OptionalSpec`. PiperOrigin-RevId: 506126332 * Fixes shape inference of LookupTableImportV2 to handle scalar values. PiperOrigin-RevId: 506126405 * Update Android full build script Use `configure` script instead of obsolete `configure_android_workspace` PiperOrigin-RevId: 506130660 * Refactor keras/metrics to be modular. PiperOrigin-RevId: 506144312 * Internal change to the ARM build. PiperOrigin-RevId: 506145147 * gpu_delegate: Allow undefined symbol PiperOrigin-RevId: 506148959 * opencl_wrapper: Update build rule to use opencl icd loader if necessary PiperOrigin-RevId: 506152314 * TensorSpec experimental_get_compiler_ir improve the captured_input support. Major changes include: * Enable the compiler_ir.from_concrete_function support speicialize_flat_input. * Improve experimental_get_compiler_ir functionality: support captured_input PiperOrigin-RevId: 506158256 * [StableHLO to MHLO] Improve Python bindings for MHLO StableHLO PR: https://github.com/openxla/stablehlo/pull/283. PiperOrigin-RevId: 506161080 * Avoid unnecessary polling in EventMgr. The TF EventMgr lets you enqueue a std::function to be run when an se::Stream finishes all the work that's currently enqueued on it. It does this by creating se::Event's on the streams and periodically polling all of them to see if they're completed. This poll loop is very expensive for some clients. If you have two se::Event's enqueued on the same se::Stream and the first event has not been hit yet, then you can be sure that the second one also hasn't been hit: A Stream's work runs in strict FIFO order. Previously EventMgr would check all of the events on every stream, doing unnecessary work. This CL changes it so it stops after the first event on a stream that hasn't been hit yet. If there are often multiple events pending on a particular stream, this should save significant CPU. While we're here, we also cleaned up EventMgr. Previously it had additional functionality about freeing tensors, but this was ripped out a while ago. Cleaning this up allows us to simplify the code somewhat. PiperOrigin-RevId: 506161538 * [StableHLO to MHLO] Relax dimension checks in TriangularSolveOp StableHLO PR: https://github.com/openxla/stablehlo/pull/893. PiperOrigin-RevId: 506162066 * [XLA] Use the async copy elapsed instead of prefetch interval picker to decide whether to disable end-of-program prefetch optimization. The shape override introduced in cl/504951495 caused the heuristic that disables end-of-program prefetch optimization to break since it was using the prefetch interval picker to gauge how long the cross-program prefetch is going to be live. This CL changes the logic to use the cost analysis directly. PiperOrigin-RevId: 506172259 * Minor touch up in release notes for 2.12. PiperOrigin-RevId: 506185475 * [StableHLO to MHLO] Handle bounds in the WhileOp shape function PiperOrigin-RevId: 506186744 * [XLA] Fix HLO parser for attribute allow_spmd_sharding_propagation_to_output. PiperOrigin-RevId: 506195622 * [StableHLO to MHLO] Remove AllShapesMatch from DynamicUpdateSliceOp StableHLO PR: https://github.com/openxla/stablehlo/pull/892. PiperOrigin-RevId: 506199864 * Implement functions for retrieving initializer functions in `tf_saved_model` dialect. Retrieving initializer functions is a common operation done in TensorFlow graph transformation passes. This change provides functions for this in the `tf_saved_model` dialect. This also replaces initializer function retrieval codes with the new functions. PiperOrigin-RevId: 506201497 * Removed `ParallelTensor` from `TensorWithLayout` and used `TensorHandlePtr`. PiperOrigin-RevId: 506209442 * [xla:cpu] Add debug info to XLA CPU pipeline This adds a pass that provides some debug info with which basic line number info can be generated. Adapted from Flang's AddDebugFoundationPass. PiperOrigin-RevId: 506213461 * update fuzztest dependency PiperOrigin-RevId: 506217195 * Remove references to stream_executor/lib PiperOrigin-RevId: 506225078 * [XLA:GPU] Handle device buffers more safely in run_hlo_module This fixes double-free errors or memory leaks for example when the running of the HLO is unsuccessful. The old code-path is also left there, as a lot of our code depends on the ability to run the same HLO multiple times without reallocating the input buffers. PiperOrigin-RevId: 506238363 * compat: Update forward compatibility horizon to 2023-02-01 PiperOrigin-RevId: 506239134 * Update GraphDef version to 1394. PiperOrigin-RevId: 506239156 * Fix a typo in the documentation in preemption_watcher.py PiperOrigin-RevId: 506240202 * Rollback of PR #58763 PiperOrigin-RevId: 506243978 * [GmlSt] Group tiling passes for cpu, gpu and triton. PiperOrigin-RevId: 506244287 * Propagate quantize_params in prepare_pass PiperOrigin-RevId: 506252805 * [GmlSt] Remove bufferization test pass. Use hlo-one-shot-bufferize instead. PiperOrigin-RevId: 506260068 * Fix build breakage for DPB. PiperOrigin-RevId: 506261904 * Integrate LLVM at llvm/llvm-project@00ce96b02e87 Updates LLVM usage to match [00ce96b02e87](https://github.com/llvm/llvm-project/commit/00ce96b02e87) PiperOrigin-RevId: 506269173 * Implement clamping of dynamic{_update,}slice start indices. PiperOrigin-RevId: 506270039 * [xla:gpu] Add verbose logging to cuda graph library Optionally print captured graphs dot files to help with debugging PiperOrigin-RevId: 506270560 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/89dc2707c7195dc2b839c7a1a987309d91fc89c7. PiperOrigin-RevId: 506270854 * [GmlSt] Split vectorization.cc into vectorize_copy/vectorize_for_cpu,gpu files. PiperOrigin-RevId: 506273813 * Fix bounds checks. - transfer_{read,write} was checking memory bounds incorrectly. - check all buffer accesses. - make invalid accesses interpreter failures instead of asserting. PiperOrigin-RevId: 506286548 * Manage snapshot streams assignments in tf.data service dispatcher. Related changes: - Added `DispatcherService::GetSnapshotStreams`, a new readonly API for seeing the state of snapshot stream assignments from the dispatcher's perspective. - Made `DispatcherConfig.worker_timeout_ms` configurable. PiperOrigin-RevId: 506287683 * Remove multiple defines of XlaLegalizeTFNoFallback This occurred because xla_legalize_tf_passes.h.inc technically depends on all passes listed in the .td file being defined. However, the no-fallback pass is intentionally supposed to be in a separate target. For now, depend on no-fallback, so xla_legalize_tf is correct, but xla_legalize_tf_no_fallback should be fully moved to a separate .td/.h file, so it doesn't surface unsupported methods. PiperOrigin-RevId: 506290313 * Skip invalid candidates, add flag for no canonicalization, bisect for errors. Don't ask me how long it took me to realize that canonicalization goof while debugging canonicalization. PiperOrigin-RevId: 506291648 * Fix hybrid indy lstm by forwarding `recurrent_to-*` parameters to `ComputeRowSums`. PiperOrigin-RevId: 506312178 * Fix a bug in which an invalidated reference to a hash table element is used after a potential rehash. `emplace` can cause a rehash that invalidates references to elements in the hashtable. PiperOrigin-RevId: 506313210 * Add path to snapshot-level done file in tf.data service snapshot on-disk state. PiperOrigin-RevId: 506317430 * Identify the "file_prefix" tensor by matching the unique `tf_saved_model.index_path` attribute. Currently the `file_prefix` tensor, which is used to identify the directory to the checkpoint file from which the variables are restored, is identified by relying on the fact that it is used as an input to the restore op. Doing so makes some assumptions (the name of the restore op) and is prone to accidental conflict. We can find the file_prefix tensor with more certainty by seeing whether the `tf_saved_model.index_path` attribute matches `__tf_file_prefix`. PiperOrigin-RevId: 506318827 * Add abstract base types for common `dataset_ops` types. The presently added types do not define any abstract methods, attributes, properties etc. for their equivalent `dataset_ops` concrete types. I.E., they do not currently define the "shape" of the type and are primarily intended for use in `isinstance` checks to avoid a direct dependency on the concrete type. The types are currently only exported under the internal namespace. PiperOrigin-RevId: 506320622 * TF Lite Runtime support for Python 3.10 under glibc 2.31 * Improve DPB documentation. PiperOrigin-RevId: 506333596 * Update ANDROID_NDK_API_LEVEL default in configure.py PiperOrigin-RevId: 506335783 * #tf-data Ramp down `stage_based_autotune` to do analysis based on the data collected. PiperOrigin-RevId: 506340159 * Integrate LLVM at llvm/llvm-project@0ece2050da3e Updates LLVM usage to match [0ece2050da3e](https://github.com/llvm/llvm-project/commit/0ece2050da3e) PiperOrigin-RevId: 506340480 * [xla][mlir][sparse] allow sparse shapes at construction time only PiperOrigin-RevId: 506342697 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/afca233650bc0ce402e8a9a07787732b04bef7aa. PiperOrigin-RevId: 506343516 * [GmlSt] Add dim(gml_st.parallel) and gml_st.parallel(tensor.cast) patterns. Additional canonicalization patterns for gml_st.parallel loop. PiperOrigin-RevId: 506344025 * Temporarily disable the flaky test for Kokoro build. PiperOrigin-RevId: 506345945 * small cleanup of fuzz helper PiperOrigin-RevId: 506350422 * #tf-data-service Clean up checkpoints after completing the snapshot. PiperOrigin-RevId: 506355658 * Check that RefineDynamicShapes doesn't leave dynamic shapes around It is expected that RefineDynamicShapes in the XlaCallModuleOp kernel fully specializes the StableHLO program to static shapes. However, we aren't checking that, so specialization failures may go unnoticed and manifest downstream in weird ways where they are harder to debug. This CL introduces an early check for this. This is a second attempt at landing this CL. The first attempt broke some tests and got rolled back. Now the broken test is disabled because it was relying on wrong behavior that we started detecting thanks to the increased scrutiny implemented here. PiperOrigin-RevId: 506356516 * Expand applicability of real_dynamic_slice canonicalizers At the moment, RealDynamicSliceOp => SliceOp canonicalization only works when start_indices, limit_indices and strides are all of type arith::ConstantOp. This CL extends canonicalization to handle any kind of m_Constant ops. Furthermore, this CL supersedes the RealDynamicSliceIsStatic C++ pattern with the RealDSliceToSlice TableGen pattern. I'm not sure why both of these patterns were enabled when they are doing roughly the same thing. PiperOrigin-RevId: 506356645 * Add warning about assumed input_signatures PiperOrigin-RevId: 506357398 * [GmlSt] Use upstream patterns to collapse extract/insert_slice. PiperOrigin-RevId: 506358242 * Modify LiteralTestUtil to ensure dynamic dimensions are equivalent when checking equality. Previously the LiteralTestUtil would consider two dynamic literals equal as long as they had identical elements (even if they had different dynamic dimensions). PiperOrigin-RevId: 506359222 * feat: update boringssl to fix aarch64 build failures PiperOrigin-RevId: 506366004 * [TF:PJRT] Use PjRtDeviceContext in XlaDevice. - Use AsyncValueAllocator as the allocator when PjRtDeviceContext is used. - Update places that use XlaDeviceContext as signature to DeviceContext. - Change GetXlaOpsCommonFlags to return XlaOpsCommonFlags* so that the flag tf_xla_use_device_api can be set in the test. - Implement Name() in AsyncValueAllocator which is a virtual function. PiperOrigin-RevId: 506369982 * Remove time (AKA time_fraction) field, since it's no longer used. We now compute this in the frontend to avoid storing this redundant field in the protobuf. PiperOrigin-RevId: 506372540 * Fix crash in simplifyDynamicGatherToGather DynamicGatherOp's slice_sizes is 1DTensorOf<[HLO_DimensionValue]> where HLO_DimensionValue is AnyTypeOf<[Index, HLO_Int]>. However, GatherOp's slice_sizes is I64ElementsAttr. If there's a mismatch in element types, canonicalization from DynamicGatherOp to GatherOp will crash, so in that case we need to explicitly convert the elements. PiperOrigin-RevId: 506374817 * Remove legacy references from `ops.py`. This is done to eventually remove the lazy loads in `indexed_slices.py`. PiperOrigin-RevId: 506375428 * Upgrade clang toolchain to use clang-16. PiperOrigin-RevId: 506381712 * [xla:gpu] Remove check on XLA_FLAGS when doing deserialization We no longer need this because XLA Runtime is enabled by default. PiperOrigin-RevId: 506382068 * Adding profiler assertions for TPU. PiperOrigin-RevId: 506382173 * Fix a operand does not dominate bug caused by tpu_extract_outside_compilation. tpu_extract_outside_compilation can create a _XlaRecvAtHostOp or a _XlaRecvAtHostV2Op op to receive at the host side. Its operand identification function (GetStaticExternalOperands) avoids including operands that are already on the host by checking if they are set by a recv, since a recv would have been created already. The bug was that only _XlaRecvAtHostV2Op was counted as a recv, not _XlaRecvAtHostOp. PiperOrigin-RevId: 506383107 * Add copybara config tests. PiperOrigin-RevId: 506396032 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/250ad8a0ccdab9d6882931d0dcdfa8fa73eceadf. PiperOrigin-RevId: 506399106 * [IFRT] Add additional ArrayImpl tests with various host buffer semantics Additional tests verify that the `Array` implementation meet the API contract as defined by `HostBufferSemantics`. This change also adds a revised version of the `PjRtBuffer::HostBufferSemantics` comment. It does not yet define a new IFRT `HostBufferSemantics` type yet for a JAX compatibility. PiperOrigin-RevId: 506401836 * [GmlSt] Use the gml-st-cpu-tiling-pipeline to test transformations. We used to test separate transformation patterns that don't include vectorization. These tests check the transformation + vectorization. Later there will be additional CHECKs in the same files for bufferization to verify that we don't allocate inside the loops. Reduce and matmul will be updated in a follow-up. PiperOrigin-RevId: 506408069 * Silence some pytype errors. PiperOrigin-RevId: 506409026 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/c6df8512943f31fbf1d2cf3fcdcbc6bc1aa747db. PiperOrigin-RevId: 506409909 * Add a DistributedValue that is backed by DTensor instance. This can be used as the input and output value for a strategy.run function. PiperOrigin-RevId: 506414540 * Redirect usages of `convert_variables_to_constants` from `graph_util_impl.py` to `convert_to_constants.py` to remove a cycle. PiperOrigin-RevId: 506414914 * Add out of bounds array check to dynamic_stitch_op. PiperOrigin-RevId: 506418249 * Partial rollforward of PR #59315. Bring back two following fixes for TF MacOS + Metal PluggableDevice: - TensorList op exclusion for MacOS - Temporary hack to avoid jit_compile on MacOS. Eigen buffer alignment fix is not included in this rollforward and will be in a separate commit. END_PUBLIC *** Reason for rollback *** Partial rollforward of PR #59315. *** Original change description *** Automated g4 rollback of changelist 504212615. Rollback PR #59315. Breaks MacOS tests. For eg: tensorflow/core/framework:tensor_test PiperOrigin-RevId: 506419803 * [XLA] Create skeleton for a partition assignment pass, which annotates the given module with (good) shardings, by adding: - an HLO pass: PartitionAssignment - a base class: PartitioningAlgorithm, - a no-op derived class extending PartitioningAlgorithm: NoopPartitioning, and - a flag to determine the algorithm (kind/type): xla_partitioning_algorithm. PiperOrigin-RevId: 506423268 * [XLA:CPU] Add concat benchmarks PiperOrigin-RevId: 506427653 * Fix memory corruption vulnerability in reverse_sequence_op. PiperOrigin-RevId: 506433062 * [XLA] Support merging partially replicated dimension in complex resharding code. PiperOrigin-RevId: 506433374 * Implement Tensorflow verification pass that ensures no TF dialect ops remain. Required so that we can remove the allow_partial_conversion check in LegalizeTF, which is required to only call LegalizeTF once. PiperOrigin-RevId: 506434351 * Delete SetPayload(absl::string_view, absl::string_view); PiperOrigin-RevId: 506438247 * Integrate LLVM at llvm/llvm-project@dbd02002dd0c Updates LLVM usage to match [dbd02002dd0c](https://github.com/llvm/llvm-project/commit/dbd02002dd0c) PiperOrigin-RevId: 506440373 * [Tensorflow] Fix security vulnerability with TensorListSplitOp PiperOrigin-RevId: 506441188 * Remove unused code in cost analysis PiperOrigin-RevId: 506441280 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/851d62673267a061aab673a33fa9ad37a5aa39fb. PiperOrigin-RevId: 506442730 * Replace `error_message()` with `message()` since we have upgraded to a newer protobuf PiperOrigin-RevId: 506443040 * #tf-data-service Add a test util for `choose_from_datasets`. PiperOrigin-RevId: 506444429 * #tf-data-service Add a check for infinite datasets. The next step is to support `repeat`, for example: datasets = [tf.data.Dataset.from_tensors("a").repeat(10), tf.data.Dataset.from_tensors("b").repeat(10), tf.data.Dataset.from_tensors("c").repeat(10)] choice_dataset = tf.data.Dataset.range(3).repeat() dataset = tf.data.Dataset.choose_from_datasets(datasets, choice_dataset) PiperOrigin-RevId: 506448078 * Recognize empty input_signatures with default value parameters PiperOrigin-RevId: 506449019 * Limit the thread pool size of the TFE context used for constant folding PiperOrigin-RevId: 506454804 * [jax] Skip compilation cache test for older jaxlibs PiperOrigin-RevId: 506460144 * Add back CLANG_CUDA_COMPILER_PATH to gpu_clang.bazelrc. PiperOrigin-RevId: 506468121 * Rollback the change to add GPU PJRT client. PiperOrigin-RevId: 506477686 * - Add _cast() to TraceType - Implement _cast() to default types and TensorSpec PiperOrigin-RevId: 506479924 * Cast `status.message()` explicitly to `std::string` PiperOrigin-RevId: 506502241 * Canonicalize RealDynamicSliceOp to DynamicSliceOp We know how to canonicalize RealDynamicSliceOp to SliceOp (when all attributes are static), but there is one more case when RealDynamicSliceOp can be canonicalized to a simpler op // Before rewrite %slice_sizes = mhlo.constant ... %limit_indices = mhlo.add %start_indices, %slice_sizes %strides = mhlo.constant dense<1> %result = mhlo.real_dynamic_slice %operand, %start_indices, %limit_indices, %strides // After rewrite %result = "mhlo.dynamic_slice"(%operand, %start_indices0, ...) { slice_sizes = ... } PiperOrigin-RevId: 506504799 * Disable `tensorflow/dtensor/python/tests:spmd_test` on Python 3.8 PiperOrigin-RevId: 506505212 * Disable `tensorflow/dtensor/python/tests:multi_client_test_nccl_local` on OSS PiperOrigin-RevId: 506507742 * Add Python specific disable tags to the bazel configs PiperOrigin-RevId: 506508975 * [Tensorflow] Fix security vulnerability with DenseBincountOp PiperOrigin-RevId: 506514542 * Update Eigen to commit:3460f3558e7b469efb8a225894e21929c8c77629 CHANGELOG ========= 3460f3558 - Use VERIFY_IS_EQUAL to compare to zeros. 13a1f25da - Revert StlIterators edit from "Fix undefined behavior..." fd2fd4870 - Update file ForwardDeclarations.h 37b2e9717 - Tweak special case handling in atan2. a1cdcdb03 - Fix undefined behavior in Block access 4a58f30aa - Fix pre-POWER8_VECTOR bugs in pcmp_lt and pnegate and reactivate psqrt. 12ad99ce6 - Remove unused variables from GenericPacketMathFunctions.h 6987a200b - Fix stupid sparse bugs with outerSize == 0 0471e61b4 - Optimize various mathematical packet ops 1aa6dc200 - Fix sparse warnings 17ae83a96 - Fix bugs exposed by enabling GPU asserts. ab8725d94 - Turn off vectorize version of rsqrt - doesn't match generic version 6d9f662a7 - Tweak atan2 6fc9de7d9 - Fix slowdown in bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4. 6d4221af7 - Revert qr tests 7f58bc98b - Refactor sparse 576448572 - More fixes for __GNUC_PATCHLEVEL__. 164ddf75a - Use __GNUC_PATCHLEVEL__ rather than __GNUC_PATCH__, according to the documentation https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html 5a7ca681d - Fix sparse insert 08c961e83 - Add custom ODR-safe assert. 3fe8c5110 - Replace the Deprecated `$<CONFIGURATION>` with `$<CONFIG>` d70b4864d - issue #2581: review and cleanup of compiler version checks b52312068 - [SYCL-2020 Support] Enabling Intel DPCPP Compiler support to Eigen bae119bb7 - Support per-thread is_malloc_allowed() state fa0bd2c34 - improve sparse permutations 2e61c0c6b - Add missing EIGEN_DEVICE_FUNC in a few places when called by asserts. 4aca06f63 - avoid move assignment in ColPivHouseholderQR 68082b822 - Fix QR, again 4d0576534 - Altivec fixes for Darwin: do not use unsupported VSX insns PiperOrigin-RevId: 506525228 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/ec4a1c4d591c9c5be3ae207551452f2f667177c7. PiperOrigin-RevId: 506526946 * Update GraphDef version to 1395. PiperOrigin-RevId: 506547190 * compat: Update forward compatibility horizon to 2023-02-02 PiperOrigin-RevId: 506547933 * [XLA] Use wide accumulator for integer types in HloEvaluator. Generally, this should not affect the operations, as the results are downcasted to ReturnT. Some integer operations (SHR, CLZ, popcnt) were updated, as they didn't previously support cases where ReturnT != ElementwiseT For convolutions, clamp the result to the ReturnT range, as discarding the high bits doesn't make sense. This allows to enable convolution tests that would otherwise fail (cl/506267884). PiperOrigin-RevId: 506548096 * Add `diagonal_recurrent_tensors` attribute to UNIDIRECTIONAL_SEQUENCE_LSTM op. PiperOrigin-RevId: 506553811 * Add test for tf.TensorScatterAdd PiperOrigin-RevId: 506561719 * Reference the `benchmark_model` instructions from the delegate performance benchmark README. Running `benchmark_model` can be useful for quick feedback during the early stages of development. PiperOrigin-RevId: 506568968 * Add convolution tests for int8x32 cuDNN vectorized layout PiperOrigin-RevId: 506573468 * [GmlSt] Split and clean-up codegen tests for matmul. PiperOrigin-RevId: 506574062 * Also log the execution time in run_hlo_module. replay_computation has this functionality, and the goal is to replace it with run_hlo_module. PiperOrigin-RevId: 506584404 * Integrate LLVM at llvm/llvm-project@7d3a181c8c18 Updates LLVM usage to match [7d3a181c8c18](https://github.com/llvm/llvm-project/commit/7d3a181c8c18) PiperOrigin-RevId: 506591519 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/49fef8924ba03c721f5f1125df217d902c42d1c3. PiperOrigin-RevId: 506593472 * [GmlSt] Split and clean-up codegen tests for reduce. PiperOrigin-RevId: 506601849 * [XLA:TPU] Avoids serializing large literals that can cause high compilation latencies. PiperOrigin-RevId: 506622024 * hide extra devices instead of raising an error. This change relaxed DTensor's safety check in the NCCL path too, to conform that tf's set visible device doesn't affect the physical device list. PiperOrigin-RevId: 506643035 * Cleanup: rename initializers_v2.py to initializers.py. PiperOrigin-RevId: 506645119 * Suppress a noisy log line. PiperOrigin-RevId: 506651054 * Provide a better error message in case of compilation failure PiperOrigin-RevId: 506656750 * Register a custom codec for `extension_type.ExtensionTypeSpec` to remove `nested_structure_coder.py`'s dependency on `extension_type.py`. PiperOrigin-RevId: 506657587 * Fix use-after-move in iterator_ops.cc PiperOrigin-RevId: 506659126 * [XNNPACK] Fix some error logging in delegate logging_context can be nullptr, so use a different macro for error logging PiperOrigin-RevId: 506664135 * Register codecs for `row_partition.RowPartitionSpec` and `ragged_tensor.RaggedTensorSpec` to remove `nested_structure_coder.py`'s dependency on them. PiperOrigin-RevId: 506666103 * Change how -Xcuda-fatbinary is passed depending on the compiler used. PiperOrigin-RevId: 506667062 * Register a codec for `resource_variable_ops.VariableSpec` to remove `nested_structure_coder.py`'s dependency on `resource_variable_ops.py`. PiperOrigin-RevId: 506676601 * [xla:cpu:next] Add remove-copies-to-out-params pass To remove redundant allocations and subsequent copies to out parameters, which come from buffer allocation. The reason why these exist is that during bufferization we must allocate a buffer for each returned result. It is only post-bufferization that we run BufferResultsToOutParams, which inserts copies to those "out" buffers from the allocated ones. The pass added here detects this pattern and removes the allocation and copy, using each output buffer directly. Example input: ``` func.func @main(%arg0: tensor<1024xf64>) -> tensor<1024xf64> { %0 = mhlo.add %arg0, %arg0 : tensor<1024xf64> return %0 : tensor<1024xf64> } ``` $ xla-opt -split-input-file -hlo-xla-runtime-pipeline %s - Before: ``` module { func.func @main(%arg0: memref<1024xf64>, %arg1: memref<1024xf64>) { %c1024 = arith.constant 1024 : index %c0 = arith.constant 0 : index %c8 = arith.constant 8 : index %cst = arith.constant 0.000000e+00 : f64 %alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xf64> scf.parallel (%arg2) = (%c0) to (%c1024) step (%c8) { %subview = memref.subview %alloc[%arg2] [8] [1] : memref<1024xf64> to memref<8xf64, strided<[1], offset: ?>> %0 = vector.transfer_read %arg0[%arg2], %cst {in_bounds = [true]} : memref<1024xf64>, vector<8xf64> %1 = arith.addf %0, %0 : vector<8xf64> vector.transfer_write %1, %subview[%c0] {in_bounds = [true]} : vector<8xf64>, memref<8xf64, strided<[1], offset: ?>> scf.yield } memref.copy %alloc, %arg1 : memref<1024xf64> to memref<1024xf64> return } } ``` - After: ``` module { func.func @main(%arg0: memref<1024xf64>, %arg1: memref<1024xf64>) { %c1024 = arith.constant 1024 : index %c0 = arith.constant 0 : index %c8 = arith.constant 8 : index %cst = arith.constant 0.000000e+00 : f64 scf.parallel (%arg2) = (%c0) to (%c1024) step (%c8) { %subview = memref.subview %arg1[%arg2] [8] [1] : memref<1024xf64> to memref<8xf64, strided<[1], offset: ?>> %0 = vector.transfer_read %arg0[%arg2], %cst {in_bounds = [true]} : memref<1024xf64>, vector<8xf64> %1 = arith.addf %0, %0 : vector<8xf64> vector.transfer_write %1, %subview[%c0] {in_bounds = [true]} : vector<8xf64>, memref<8xf64, strided<[1], offset: ?>> scf.yield } return } } ``` PiperOrigin-RevId: 506678216 * Register a codec for `tensor_array_ops.TensorArraySpec` to remove `nested_structure_coder.py`'s dependency on `tensor_array_ops.py`. PiperOrigin-RevId: 506681769 * [mhlo] Remove the undefined AllReduceOp build(). PiperOrigin-RevId: 506683695 * Use graph export pipeline V2 in TPU Bridge This new graph export pipeline can avoid to generate some unnecessary control dependencies, bring better performance and make the control dependencies more readable. PiperOrigin-RevId: 506687026 * Move custom codecs for TensorSpec and BoundedTensorSpec to `tensor_spec.py`. Register a codec for `sparse_tensor.SparseTensorSpec`. PiperOrigin-RevId: 506690720 * Factor out get_default_ops and make get_ops_from_nodedef a public method in TF selective_registration_header_lib. PiperOrigin-RevId: 506697634 * Integrate LLVM at llvm/llvm-project@6dd84983d0c1 Updates LLVM usage to match [6dd84983d0c1](https://github.com/llvm/llvm-project/commit/6dd84983d0c1) PiperOrigin-RevId: 506708273 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/4487e42c7e3dc1f6d641bb1c98b01990fbbbc167. PiperOrigin-RevId: 506711487 * Register a codec for `indexed_slices.IndexedSlicesSpec` to remove `nested_structure_coder.py`'s dependency on `indexed_slices.py`. PiperOrigin-RevId: 506715268 * Disable tsan for distributed snapshot fault tolerance tests. PiperOrigin-RevId: 506724161 * #tf-data-service Update the default protocol in DistributedSaveOp. PiperOrigin-RevId: 506725241 * use common string for profiler lock contention detection. PiperOrigin-RevId: 506726080 * gpu_delegate: Link nativewindow PiperOrigin-RevId: 506727041 * Call XNNPACK Transpose from TFLite kernel. PiperOrigin-RevId: 506737055 * Integrate LLVM at llvm/llvm-project@16c8709cf61b Updates LLVM usage to match [16c8709cf61b](https://github.com/llvm/llvm-project/commit/16c8709cf61b) PiperOrigin-RevId: 506742350 * Fix dimension mismatch bug in MultinomialOp GPU implementation. PiperOrigin-RevId: 506744108 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/6e37e534eaa88a022470d77d457722249235d331. PiperOrigin-RevId: 506745467 * #tf-data-service Write tmp files in the same file system as the snapshot. `rename` requires the source and destination files be in the same file system. The temp files are named similar to https://github.com/tensorflow/tensorflow/blob/33722bc185e676c99f738790ef35db8479f2f7d4/tensorflow/core/data/snapshot_utils.cc#L950. PiperOrigin-RevId: 506746696 * Add a metric in the eager function runtime to measure when a tf.function should be compiled. This metric will cover all TF2 jit_compilation paths including TPU to give an accurate number for the number of tf.functions that will be compiled per device type. PiperOrigin-RevId: 506750567 * Integrate LLVM at llvm/llvm-project@10939d1d580b Updates LLVM usage to match [10939d1d580b](https://github.com/llvm/llvm-project/commit/10939d1d580b) PiperOrigin-RevId: 506761372 * #tf-data-service Use absl::Duration in dispatcher_impl. PiperOrigin-RevId: 506762070 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/91d765cad5599f9710973d3e34d4dc22583e2e79. PiperOrigin-RevId: 506763472 * Fix for the "bfc_allocator_delay" metric being registered multiple times. PiperOrigin-RevId: 506778911 * support big-endian for numpy type descriptor * Patch llvm to fix Windows build. PiperOrigin-RevId: 506800859 * Allow batch function to avoid padding the inputs. PiperOrigin-RevId: 506820270 * [XLA] Speed up constant folding by optimizing and inlining a few simple index/shape/layout utility functions. PiperOrigin-RevId: 506827782 * Integrate LLVM at llvm/llvm-project@9b7e57470155 Updates LLVM usage to match [9b7e57470155](https://github.com/llvm/llvm-project/commit/9b7e57470155) PiperOrigin-RevId: 506834755 * Update GraphDef version to 1396. PiperOrigin-RevId: 506834807 * compat: Update forward compatibility horizon to 2023-02-03 PiperOrigin-RevId: 506834823 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/27d9da88424935b171d2f28b63658d3ee85bfe5c. PiperOrigin-RevId: 506836991 * Move ReplicateTensorListInitOps pass before Canonicalizer pass to cleanup uninitialized tensor lists. PiperOrigin-RevId: 506838420 * Implement `tf.AssignVariableOp(tf.VarHandleOp, tf.ConstOp)` removal pattern into a separate pass. This was part of the `InsertRestoreOpPass` to replace the variable initialization patterns that uses consts with the initialization that uses RestoreV2 op. In order to support future plans to insert SaveV2 op and to make the passes more modular, this change splits the pattern removal part out into a separate pass named `RemoveVariableInitializationByCosntPass`. The new pass is put right after the `InsertRestoreOpPass` and the resulting exported model should be the same as without this change. PiperOrigin-RevId: 506863386 * Make to use new MLIR for 16x8 unidirectional LSTM operation PiperOrigin-RevId: 506864247 * Add tests to rigorously check channel dimension attributes Checks that all relevant attributes are properly changed for per_channel and per_tensor case. PiperOrigin-RevId: 506873551 * Fix a bug in the validation_model build macro. This build macro uses a temporary file named $(@D)/tmp, which is a problem, because this build macro is instantiated twice in the same package, and both of the generated rules are executed in parallel during the build, and both try to write to the same tmp file, and then both of them try to remove it. This leads to one of the rules failing due to the file not being found, on account of it having been removed by the other rule. The fix is to instead use $(@D)/<name>.tflite.tmp as the name of the temporary file, where <name> is the name of the rule. This is sufficient to ensure that the temporary file names used by the different instantiations of this build macro are distinct. PiperOrigin-RevId: 506882948 * Fixes crashes due to buzz input for ApproxTopK PiperOrigin-RevId: 506898015 * [JITRT] Add scatter benchmark. PiperOrigin-RevId: 506909548 * [XLA] Reduce some unnecessary string creations in xla::Printer callsites. PiperOrigin-RevId: 506909676 * always instrument module level no matter if trace is enabled. PiperOrigin-RevId: 506915862 * In ExportXlaFunctions, iterate all ops under the xla function instead of only the ones that inherit from SymbolUserOpInterface. Ideally all ops referencing functions should inherit from SymbolUserOpInterface. But that would take some time to happen. PiperOrigin-RevId: 506918556 * Move `_TensorShapeCodec` to `tensor_shape.py` to remove `nested_structure_coder.py`'s dependency on `tensor_shape.py`. PiperOrigin-RevId: 506923585 * Adds a method to build datasets on workers without creating an iterator when doing ParameterServerStrategy training. PiperOrigin-RevId: 506923786 * adding TensorShape fuzzers PiperOrigin-RevId: 506927213 * Turn lazy load of `nested_structure_coder` in `type_spec.py` into a regular import. PiperOrigin-RevId: 506927612 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/2796b3e7ea8a3e7614029f9e307c0113f8d6bb90. PiperOrigin-RevId: 506929191 * Set shared_name to node name if it is empty in BatchFunctionFallback op PiperOrigin-RevId: 506931118 * Add some missing BUILD dependencies of `structured_tensor.py`. PiperOrigin-RevId: 506932682 * Remove dependency on indexed_slices.internal_convert_to_tensor_or_indexed_slices. PiperOrigin-RevId: 506934163 * [XLA] Add way to allow propagation to output only to a subset of root instruction tuple shardings. PiperOrigin-RevId: 506935285 * Support float64 under CPU/GPU CopyToMesh PiperOrigin-RevId: 506939058 * Turn lazy loads related to `saved_model/nested_structure_coder.py` in `data/ops/` into regular imports. PiperOrigin-RevId: 506940249 * Delete legacy objects from `ops.py`. I have moved all references to these objects to reference their respective files. PiperOrigin-RevId: 506945566 * Add a new replica context for DTensor related strategy. Since DTensor operate in a global context, we mostly raise an explicit error about methods are not available in the context. This will lead to a behavior discrepancy between the existing strategy and new one. We should consider this carefully for the future plan when rebasing the strategy on top of DTensor (eg using replicated_run/spmd_run). PiperOrigin-RevId: 506945673 * Replace `nest` usage in `FuncGraph.capture_call_time_value()` with `TraceType._cast()` PiperOrigin-RevId: 506949659 * Add implementation for strategy.gather() for new MirroredStrategy. PiperOrigin-RevId: 506952602 * [Tensorflow] Fix security vulnerability with UniqueV2. The bug is that the axis index should be canonicalized when it's negative. PiperOrigin-RevId: 506966510 * [XLA:CPU] Scalarize scf.if op PiperOrigin-RevId: 506969655 * [tflite-gpu] Fix OpenGL slice calculation bug. PiperOrigin-RevId: 506971865 * Don't use xnnpack in kernel with the flag --define=tflite_with_xnnpack=false PiperOrigin-RevId: 506975198 * [XLA:TPU] Speed up constant propagation by inlining a few methods in comparison_util.h. PiperOrigin-RevId: 506977511 * Made `TensorWithLayout` an abstract class and defined `TensorWithLayoutTf` to hold the current implementation using TF runtime. Also used `ConstValueNode` to capture the information useful for TF runtime. PiperOrigin-RevId: 506978870 * Skip license check for pybind11_abseil PiperOrigin-RevId: 506979746 * Moving control_captures out of FuncGraph PiperOrigin-RevId: 506985311 * Use gfile over the native open for an implementation based on TensorFlow's C++ FileSystem API. PiperOrigin-RevId: 506985831 * Make the HloModule constructor that takes CompilationEnvironments public. PiperOrigin-RevId: 506993058 * Add float16 and float64 input&output type support for TFLite operator 'cast' Type float16 and float64 input/output for TensorFlow 'cast' operator is used in some Federated Learning models, thus adding these type supports to TFLite 'cast' op can make these operators converted to TFLite build-in ops instead of flex ops. PiperOrigin-RevId: 506997479 * color adjust for timeline PiperOrigin-RevId: 507002833 * [TF:PLUGIN] Fix a dependency issue. PiperOrigin-RevId: 507003433 * Add isinstance check for eager execution. PiperOrigin-RevId: 507003564 * -Add 4 bit support to depthwise_conv.cc and fully_connected.cc in TfLite using the reference kernels 4bit functions for those op . And added/changed supporting functions to get test to run on fully_connected_test.cc -added a 4bit Test(Simple4bit3x3FilterTest) to depthwise_conv_test.cc in Tflite which is ported from the existing Simple3x3FilterTest with adjusted PerChannelQuanization scales for 4bit input. -added a 4bit Test(SimpleTestQuantizedInt4) to fully_connected_test.cc in Tflite which is ported from the existing SimpleTestQuantizedInt8 with adjusted outputs for 4bit. PiperOrigin-RevId: 507003918 * Add Keras metrics FBetaScore and F1Score. PiperOrigin-RevId: 507013486 * Update the TensorFlow RELEASE.md on master. (We cut the branch for 2.12.0. Insert new blurb for new release notes TF 2.13.0) PiperOrigin-RevId: 507017354 * Implement functional<->regional transformation for `CaseOp` and `CaseRegionOp` Even if we already have `CaseRegionOp` as a region version of `CaseOp`, the associated transformations were missing in functional<->regional control flow transformation passes. This CL implements them. PiperOrigin-RevId: 507017912 * In the code, we have some modules that are "based" off other modules but not exact clones. Change these code locations, so these modules retain the CompilationEnvironments from the original module. PiperOrigin-RevId: 507020873 * Update the TensorFlow RELEASE.md on master. (We cut the branch for 2.12.0. Insert new blurb for new release notes TF 2.13.0) PiperOrigin-RevId: 507021191 * Break dependency between tensorflow/core/function/transform:transform and python/saved_model. PiperOrigin-RevId: 507021697 * Handle snapshot and stream completion in tf.data service dispatcher. PiperOrigin-RevId: 507023442 * Fix release notes. Becuase the automation that updates the release notes in master after branch cut for release happens has been destroyed and the step has not been done manually in time, we have commits such as https://github.com/tensorflow/tensorflow/commit/9fbf1137044ac63e296ebf73c61b1e8513149b1c# and https://github.com/tensorflow/tensorflow/commit/ba1372a41ed90aba0efa5763b06350dd0ee7074b that write the wrong contents to the release notes. PiperOrigin-RevId: 507025073 * Implement functional<->regional transformation for `CaseOp` and `CaseRegionOp` Even if we already have `CaseRegionOp` as a region version of `CaseOp`, the associated transformations were missing in functional<->regional control flow transformation passes. This CL implements them. PiperOrigin-RevId: 507030461 * #tf-data-service Wait for the snapshot DONE file in unit tests. PiperOrigin-RevId: 507030943 * Expose TF2XLA MLIR pipeline for reuse PiperOrigin-RevId: 507033096 * Update master version numbers to 2.13.0. Branch for TF 2.12 releases has been cut. Switch to new version. PiperOrigin-RevId: 507037877 * [PJRT C API] Add C API for PjRtCApiClient::LookupAddressableDevice. wrapped_device_map_ and GetCApiDevice can be removed after LookupAddressableDevice is added. PiperOrigin-RevId: 507060995 * Work around compiler bug returning an optional unique_ptr. It looks like some compilers (e.g. gcc-7) don't like returning a moveable value directly when the return type is `std::optional` (i.e. it fails to treat the returned value as an r-value and automatically construct an optional instance around it). Explicitly creating the `std::optional` and returning _that_ seems to work around the issue. PiperOrigin-RevId: 507062621 * Update GraphDef version to 1397. PiperOrigin-RevId: 507090576 * compat: Update forward compatibility horizon to 2023-02-04 PiperOrigin-RevId: 507090630 * Update the version of Estimator nightly and Keras nightly used in TensorFlow after the corresponding nightly packages with the next version are released PiperOrigin-RevId: 507168035 * Add tfstreamz for input spec mismatch cases. PiperOrigin-RevId: 507180105 * Implement shape inference for `CaseOp` and `CaseRegionOp` PiperOrigin-RevId: 507201660 * compat: Update forward compatibility horizon to 2023-02-05 PiperOrigin-RevId: 507239797 * Update GraphDef version to 1398. PiperOrigin-RevId: 507239865 * #tf-data-service Use a mock dispatcher in the split provider test. The original test works by manipulating the files. It makes the test depend on the snapshot_manager's state. When the snapshot_manager implementation changes, it could affect this test because of the structure of the test not because of a bug. Decoupling it from the dispatcher makes the test cleaner, more stable, and less likely to be flaky. PiperOrigin-RevId: 507312461 * Preserve the linear index when computing the operand of concatenate. If the Concatenate op concatenates the fastest varying dimension, we can relatively cheaply preserve the linear index. This is a HLO snippet where we see a 30% improvement with this change: HloModule concatenate ENTRY main { param = f32[100,11,12,13,160]{4,3,2,1,0} parameter(0) param2 = f32[27456000]{0} parameter(1) reshape = f32[100,11,12,13,160]{4,3,2,1,0} reshape(param2) ROOT concat = f32[100,11,12,13,320]{4,3,2,1,0} concatenate(param, reshape), dimensions={4} } PiperOrigin-RevId: 507391165 * compat: Update forward compatibility horizon to 2023-02-06 PiperOrigin-RevId: 507406616 * Update GraphDef version to 1399. PiperOrigin-RevId: 507406617 * Add math dialect. PiperOrigin-RevId: 507412764 * Integrate LLVM at llvm/llvm-project@8c712296fb75 Updates LLVM usage to match [8c712296fb75](https://github.com/llvm/llvm-project/commit/8c712296fb75) PiperOrigin-RevId: 507429387 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/e94b53450349f6837d11cc39f614af86c825ef94. PiperOrigin-RevId: 507431532 * Track memref allocations, deallocations and peak heap size. PiperOrigin-RevId: 507449826 * [TF/MLIR] Supports lowering mhlo.reduce_window when there is reshape/broadcast in the divisor. PiperOrigin-RevId: 507455712 * updated TF patch PiperOrigin-RevId: 507456411 * [XLA:GPU] Do not expand Scatter ops that are deterministic when running with --xla_gpu_deterministic_ops. Currently all Scatter ops are expanded when deterministic ops are enforced. However, scatter ops on unique indices cannot have data races irrespective of the implementation. Similarly, scatter ops with associative combiner functions will compute deterministic results irrespective of the order in which the combiner function is applied. In both cases, scatter will be deterministic and expanding it is thus not required. This reduces slowdowns due to the xla_gpu_deterministic_ops flag. PiperOrigin-RevId: 507460239 * BladeDISC patch 20221101 1, Build related changes: * No force -std=c++17 for cuda. https://github.com/pai-disc/tensorflow/commit/af4d5a07589c1d30c14c76aba6592554210451a5 * workaround compilation error on GCC 7.3.1: (#19) like: undefined reference to `std::function<tensorflow::StatusOr<xla::Shape> (xla::Shape const&, bool, mlir::XlaLayoutPreference)>::function()' * [to #35355928] fix build issue when enabling MLIR_GPU_TO_CUBIN_PASS_ENABLE * disable `-Werror=unused-result` * disable `noincompatible_remove_legacy_whole_archive` * add miss dependency `//tensorflow/compiler/xla/stream_executor:dnn_proto_cc_impl` 2, hlo related changes: * [to #35377611] feat: bufferize DotOp and DotGeneralOp. remove community DotGeneralOp bufferizer as well Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/5885662 * [to #35377611] feat: bufferize ConvOp and DynamicConvOp. Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/5910553 * [to #37276187] feat: bufferize mhlo.log1pOp Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6675686 * [to #36574644] [MLIR] [DISC] Add reverse op in lmhlo Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6364791 * support RoundNearestEvenOp (#20) * support RoundOp (#21) * add const folding support for mhlo.TransposeOp * disable some static checks of mhlo.dot_general `tf.BatchMatmul(tensor<?x?x?xf32>, tensor<4x?x?xf32>)` is valid, while the `tf.BatchMatmul` tf2mhlo converter does not handle shape propagation between the lhs & rhs, leading to some of the static check of `dot_general` failed. Just disable the check as a workaround. * [to #35355928] fix missing Elementwise traits of some special ops * [to #35355928] fix a bug in lmhlo structured interface * enhance `maybeCastTo` to support casting between i64 and i32. * cast to i64 only if index type in DynamicIotaBroadcast pattern. * add a patch not to fold UnrealizedConversionCastOp with ui/si type 3, TF related changes: * lower tf.GatherV2 op to mhlo in dynamic shape * lower tf.DepthwiseConv2DNative op to mhlo in dynamic shape * lower tf.StridedSlice op to mhlo in dynamic shape * lower tf.DynamicStitchOp to mhlo in dynamic shape * lower tf.BatchMatMulOp op to mhlo in dynamic shape * lower tf.Conv2DBackpropInputOp/tf.Conv2DBackpropFilterOp to mhlo in dynamic shape * support tf.Range wit negative stride * support tf.StridedSlice with new_axis_mask * add mhlo_disc dependency in xla legalize_tf * legalize quantized tf const before lowering to mhlo * add tf2mhlo support for tf.BatchMatMul * bugfix: only handling non-const begin/end in ConvertStridedSliceOpDynamic * bugfix: using tf.TileOp static tf2mhlo conversion only when all ins/outs have static shape * bugfix: size of begin/end/strides < the rank of input * bugfix: disable TF_RandomUniformOp in tf->mhlo * fix a bug in tf.SigmoidGradOp legalize pattern * fix a bug in ConvertSplitOp pattern * fix a bug in ConvertUnpackOp pattern * [to #36775150] feat: to support multi StreamExecutor by stream as cache key In the original design of StreamExecutor, one StreamExecutor maps to each device ordinal and owns one cudnnHandle. This means in multiple stream applications, there will be only one StreamExecutor for each GPU device. We observed dramatic performance degrade due to the lock in the SE, refer to https://yuque.antfin-inc.com/pai/blade/irdw7g. This commit revises the executor cache, so that there will be multiple StreamExecutors for each gpu stream. Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6455216 * [to #36574492]feat: dynamic strided slice op supports negative strides Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6425899 3, ROCM/DCU related changes: * [to #37531008] feat: Support building for DCU Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6897356 * [to #37531008] allow asigned stream in se for rocm backend Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6918944 * [to #37383705] patch: Add shfl.sync.bfly lowering to ROCm Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6937184 * [to #39814120] Fix config for Macro and support gfx90a * [rocm] remove -fno-canonical-system-headers as hipcc does not support * [to #37531008] expose some more methods in gpu_backend_lib.h 4, others: * update gitignore * Add llvm patch to fix error GVN on shared memory load. (#4) add folder for mhlo::ClampOp import iree in workspace file increase `kFoldOpEltLimit` to 64GB decompose disc-compiler (#27) * decompose disc-compiler * update * fix some compilation erros * Fix dynamic shape reifyReturnTypeShapes * fix dynamic shapes & mhlo op operands & proto_alls * fix llvm patch * remove iree * reverse unsigned type lowering workaround (#30) --------- Co-authored-by: Aart Bik <[email protected]> Co-authored-by: Fiona Lang <[email protected]> Co-authored-by: Peng Wang <[email protected]> Co-authored-by: Terry Heo <[email protected]> Co-authored-by: Francois Chollet <[email protected]> Co-authored-by: A. Unique TensorFlower <[email protected]> Co-authored-by: John QiangZhang <[email protected]> Co-authored-by: Eugene Burmako <[email protected]> Co-authored-by: Justin Lebar <[email protected]> Co-authored-by: Berkin Ilbeyi <[email protected]> Co-authored-by: Xin Zhou <[email protected]> Co-authored-by: Ce Zheng <[email protected]> Co-authored-by: Dan Suh <[email protected]> Co-authored-by: Dateng Lin <[email protected]> Co-authored-by: Son Tuan Vu <[email protected]> Co-authored-by: David Dunleavy <[email protected]> Co-authored-by: Alexander Belyaev <[email protected]> Co-authored-by: Ian Hua <[email protected]> Co-authored-by: Johannes Reifferscheid <[email protected]> Co-authored-by: Eugene Zhulenev <[email protected]> Co-authored-by: Matt Callanan <[email protected]> Co-authored-by: Tres Popp <[email protected]> Co-authored-by: samypr100 <[email protected]> Co-authored-by: Wilsin Gosti <[email protected]> Co-authored-by: Yang Chen <[email protected]> Co-authored-by: Faizan Muhammad <[email protected]> Co-authored-by: Vadym Matsishevskyi <[email protected]> Co-authored-by: Jieying Luo <[email protected]> Co-authored-by: Juan Martinez Castellanos <[email protected]> Co-authored-by: Brian Wieder <[email protected]> Co-authored-by: Anlun Xu <[email protected]> Co-authored-by: Roshani Narasimhan <[email protected]> Co-authored-by: Michael Delorimier <[email protected]> Co-authored-by: Hyeontaek Lim <[email protected]> Co-authored-by: Rebecca Chen <[email protected]> Co-authored-by: Scott Zhu <[email protected]> Co-authored-by: Mason Chang <[email protected]> Co-authored-by: Penporn Koanantakool <[email protected]> Co-authored-by: Frederik Gossen <[email protected]> Co-authored-by: Matthias Kramm <[email protected]> Co-authored-by: Marcello Maggioni <[email protected]> Co-authored-by: Jean-Baptiste Lespiau <[email protected]> Co-authored-by: Jorge Gorbe Moya <[email protected]> Co-authored-by: Jian Cai <[email protected]> Co-authored-by: Kuangyuan Chen <[email protected]> Co-authored-by: Nitin Srinivasan <[email protected]> Co-authored-by: Zhufeng Pan <[email protected]> Co-authored-by: Sergey Kozub <[email protected]> Co-authored-by: Aliia Khasanova <[email protected]> Co-authored-by: Matt Kreileder <[email protected]> Co-authored-by: Adrian Kuegel <[email protected]> Co-authored-by: Andrew Audibert <[email protected]> Co-authored-by: Zhi An Ng <[email protected]> Co-authored-by: Emilio Cota <[email protected]> Co-authored-by: Jared Junyoung Lim <[email protected]> Co-authored-by: Jie Sun <[email protected]> Co-authored-by: Grant Jensen <[email protected]> Co-authored-by: Antonio Sanchez <[email protected]> Co-authored-by: Ken Franko <[email protected]> Co-authored-by: Fabien Hertschuh <[email protected]> Co-authored-by: Kazuaki Ishizaki <[email protected]> Co-authored-by: Tomás Longeri <[email protected]> Co-authored-by: Bangda Zhou <[email protected]> Co-authored-by: Fergus Henderson <[email protected]> Co-authored-by: Felix Chern <[email protected]> Co-authored-by: Yifan Jiang <[email protected]> Co-authored-by: James Mullenbach <[email protected]> Co-authored-by: Justin Szaday <[email protected]> Co-authored-by: Youchuan Hu <[email protected]> Co-authored-by: Chuanhao Zhuge <[email protected]> Co-authored-by: Vinila Settem <[email protected]> Co-authored-by: Mihai Maruseac <[email protected]> Co-authored-by: Ashish Shenoy <[email protected]> Co-authored-by: Yiming Zhang <[email protected]> Co-authored-by: Thomas Joerg <[email protected]> Co-authored-by: Wenyi Zhao <[email protected]> Co-authored-by: TanyoKwok <[email protected]>

review-notebook-app · 2023-04-18T04:02:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

* [xla][mlir][sparse] override sparse shape behavior for xla runtime path PiperOrigin-RevId: 506126261 * Create a codec class for built-in `TypeSpec`s to register, to make `TypeSpec` classes follow the codec structure used by the rest of `nested_structure_coder.py`. Also remove `nested_structure_coder.py`'s dependency on `dataset_ops.DatasetSpec`, `values.PerReplicaSpec`, `iterator_ops.IteratorSpec`, and `optional_ops.OptionalSpec`. PiperOrigin-RevId: 506126332 * Fixes shape inference of LookupTableImportV2 to handle scalar values. PiperOrigin-RevId: 506126405 * Update Android full build script Use `configure` script instead of obsolete `configure_android_workspace` PiperOrigin-RevId: 506130660 * Refactor keras/metrics to be modular. PiperOrigin-RevId: 506144312 * Internal change to the ARM build. PiperOrigin-RevId: 506145147 * gpu_delegate: Allow undefined symbol PiperOrigin-RevId: 506148959 * opencl_wrapper: Update build rule to use opencl icd loader if necessary PiperOrigin-RevId: 506152314 * TensorSpec experimental_get_compiler_ir improve the captured_input support. Major changes include: * Enable the compiler_ir.from_concrete_function support speicialize_flat_input. * Improve experimental_get_compiler_ir functionality: support captured_input PiperOrigin-RevId: 506158256 * [StableHLO to MHLO] Improve Python bindings for MHLO StableHLO PR: https://github.com/openxla/stablehlo/pull/283. PiperOrigin-RevId: 506161080 * Avoid unnecessary polling in EventMgr. The TF EventMgr lets you enqueue a std::function to be run when an se::Stream finishes all the work that's currently enqueued on it. It does this by creating se::Event's on the streams and periodically polling all of them to see if they're completed. This poll loop is very expensive for some clients. If you have two se::Event's enqueued on the same se::Stream and the first event has not been hit yet, then you can be sure that the second one also hasn't been hit: A Stream's work runs in strict FIFO order. Previously EventMgr would check all of the events on every stream, doing unnecessary work. This CL changes it so it stops after the first event on a stream that hasn't been hit yet. If there are often multiple events pending on a particular stream, this should save significant CPU. While we're here, we also cleaned up EventMgr. Previously it had additional functionality about freeing tensors, but this was ripped out a while ago. Cleaning this up allows us to simplify the code somewhat. PiperOrigin-RevId: 506161538 * [StableHLO to MHLO] Relax dimension checks in TriangularSolveOp StableHLO PR: https://github.com/openxla/stablehlo/pull/893. PiperOrigin-RevId: 506162066 * [XLA] Use the async copy elapsed instead of prefetch interval picker to decide whether to disable end-of-program prefetch optimization. The shape override introduced in cl/504951495 caused the heuristic that disables end-of-program prefetch optimization to break since it was using the prefetch interval picker to gauge how long the cross-program prefetch is going to be live. This CL changes the logic to use the cost analysis directly. PiperOrigin-RevId: 506172259 * Minor touch up in release notes for 2.12. PiperOrigin-RevId: 506185475 * [StableHLO to MHLO] Handle bounds in the WhileOp shape function PiperOrigin-RevId: 506186744 * [XLA] Fix HLO parser for attribute allow_spmd_sharding_propagation_to_output. PiperOrigin-RevId: 506195622 * [StableHLO to MHLO] Remove AllShapesMatch from DynamicUpdateSliceOp StableHLO PR: https://github.com/openxla/stablehlo/pull/892. PiperOrigin-RevId: 506199864 * Implement functions for retrieving initializer functions in `tf_saved_model` dialect. Retrieving initializer functions is a common operation done in TensorFlow graph transformation passes. This change provides functions for this in the `tf_saved_model` dialect. This also replaces initializer function retrieval codes with the new functions. PiperOrigin-RevId: 506201497 * Removed `ParallelTensor` from `TensorWithLayout` and used `TensorHandlePtr`. PiperOrigin-RevId: 506209442 * [xla:cpu] Add debug info to XLA CPU pipeline This adds a pass that provides some debug info with which basic line number info can be generated. Adapted from Flang's AddDebugFoundationPass. PiperOrigin-RevId: 506213461 * update fuzztest dependency PiperOrigin-RevId: 506217195 * Remove references to stream_executor/lib PiperOrigin-RevId: 506225078 * [XLA:GPU] Handle device buffers more safely in run_hlo_module This fixes double-free errors or memory leaks for example when the running of the HLO is unsuccessful. The old code-path is also left there, as a lot of our code depends on the ability to run the same HLO multiple times without reallocating the input buffers. PiperOrigin-RevId: 506238363 * compat: Update forward compatibility horizon to 2023-02-01 PiperOrigin-RevId: 506239134 * Update GraphDef version to 1394. PiperOrigin-RevId: 506239156 * Fix a typo in the documentation in preemption_watcher.py PiperOrigin-RevId: 506240202 * Rollback of PR #58763 PiperOrigin-RevId: 506243978 * [GmlSt] Group tiling passes for cpu, gpu and triton. PiperOrigin-RevId: 506244287 * Propagate quantize_params in prepare_pass PiperOrigin-RevId: 506252805 * [GmlSt] Remove bufferization test pass. Use hlo-one-shot-bufferize instead. PiperOrigin-RevId: 506260068 * Fix build breakage for DPB. PiperOrigin-RevId: 506261904 * Integrate LLVM at llvm/llvm-project@00ce96b02e87 Updates LLVM usage to match [00ce96b02e87](https://github.com/llvm/llvm-project/commit/00ce96b02e87) PiperOrigin-RevId: 506269173 * Implement clamping of dynamic{_update,}slice start indices. PiperOrigin-RevId: 506270039 * [xla:gpu] Add verbose logging to cuda graph library Optionally print captured graphs dot files to help with debugging PiperOrigin-RevId: 506270560 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/89dc2707c7195dc2b839c7a1a987309d91fc89c7. PiperOrigin-RevId: 506270854 * [GmlSt] Split vectorization.cc into vectorize_copy/vectorize_for_cpu,gpu files. PiperOrigin-RevId: 506273813 * Fix bounds checks. - transfer_{read,write} was checking memory bounds incorrectly. - check all buffer accesses. - make invalid accesses interpreter failures instead of asserting. PiperOrigin-RevId: 506286548 * Manage snapshot streams assignments in tf.data service dispatcher. Related changes: - Added `DispatcherService::GetSnapshotStreams`, a new readonly API for seeing the state of snapshot stream assignments from the dispatcher's perspective. - Made `DispatcherConfig.worker_timeout_ms` configurable. PiperOrigin-RevId: 506287683 * Remove multiple defines of XlaLegalizeTFNoFallback This occurred because xla_legalize_tf_passes.h.inc technically depends on all passes listed in the .td file being defined. However, the no-fallback pass is intentionally supposed to be in a separate target. For now, depend on no-fallback, so xla_legalize_tf is correct, but xla_legalize_tf_no_fallback should be fully moved to a separate .td/.h file, so it doesn't surface unsupported methods. PiperOrigin-RevId: 506290313 * Skip invalid candidates, add flag for no canonicalization, bisect for errors. Don't ask me how long it took me to realize that canonicalization goof while debugging canonicalization. PiperOrigin-RevId: 506291648 * Fix hybrid indy lstm by forwarding `recurrent_to-*` parameters to `ComputeRowSums`. PiperOrigin-RevId: 506312178 * Fix a bug in which an invalidated reference to a hash table element is used after a potential rehash. `emplace` can cause a rehash that invalidates references to elements in the hashtable. PiperOrigin-RevId: 506313210 * Add path to snapshot-level done file in tf.data service snapshot on-disk state. PiperOrigin-RevId: 506317430 * Identify the "file_prefix" tensor by matching the unique `tf_saved_model.index_path` attribute. Currently the `file_prefix` tensor, which is used to identify the directory to the checkpoint file from which the variables are restored, is identified by relying on the fact that it is used as an input to the restore op. Doing so makes some assumptions (the name of the restore op) and is prone to accidental conflict. We can find the file_prefix tensor with more certainty by seeing whether the `tf_saved_model.index_path` attribute matches `__tf_file_prefix`. PiperOrigin-RevId: 506318827 * Add abstract base types for common `dataset_ops` types. The presently added types do not define any abstract methods, attributes, properties etc. for their equivalent `dataset_ops` concrete types. I.E., they do not currently define the "shape" of the type and are primarily intended for use in `isinstance` checks to avoid a direct dependency on the concrete type. The types are currently only exported under the internal namespace. PiperOrigin-RevId: 506320622 * TF Lite Runtime support for Python 3.10 under glibc 2.31 * Improve DPB documentation. PiperOrigin-RevId: 506333596 * Update ANDROID_NDK_API_LEVEL default in configure.py PiperOrigin-RevId: 506335783 * #tf-data Ramp down `stage_based_autotune` to do analysis based on the data collected. PiperOrigin-RevId: 506340159 * Integrate LLVM at llvm/llvm-project@0ece2050da3e Updates LLVM usage to match [0ece2050da3e](https://github.com/llvm/llvm-project/commit/0ece2050da3e) PiperOrigin-RevId: 506340480 * [xla][mlir][sparse] allow sparse shapes at construction time only PiperOrigin-RevId: 506342697 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/afca233650bc0ce402e8a9a07787732b04bef7aa. PiperOrigin-RevId: 506343516 * [GmlSt] Add dim(gml_st.parallel) and gml_st.parallel(tensor.cast) patterns. Additional canonicalization patterns for gml_st.parallel loop. PiperOrigin-RevId: 506344025 * Temporarily disable the flaky test for Kokoro build. PiperOrigin-RevId: 506345945 * small cleanup of fuzz helper PiperOrigin-RevId: 506350422 * #tf-data-service Clean up checkpoints after completing the snapshot. PiperOrigin-RevId: 506355658 * Check that RefineDynamicShapes doesn't leave dynamic shapes around It is expected that RefineDynamicShapes in the XlaCallModuleOp kernel fully specializes the StableHLO program to static shapes. However, we aren't checking that, so specialization failures may go unnoticed and manifest downstream in weird ways where they are harder to debug. This CL introduces an early check for this. This is a second attempt at landing this CL. The first attempt broke some tests and got rolled back. Now the broken test is disabled because it was relying on wrong behavior that we started detecting thanks to the increased scrutiny implemented here. PiperOrigin-RevId: 506356516 * Expand applicability of real_dynamic_slice canonicalizers At the moment, RealDynamicSliceOp => SliceOp canonicalization only works when start_indices, limit_indices and strides are all of type arith::ConstantOp. This CL extends canonicalization to handle any kind of m_Constant ops. Furthermore, this CL supersedes the RealDynamicSliceIsStatic C++ pattern with the RealDSliceToSlice TableGen pattern. I'm not sure why both of these patterns were enabled when they are doing roughly the same thing. PiperOrigin-RevId: 506356645 * Add warning about assumed input_signatures PiperOrigin-RevId: 506357398 * [GmlSt] Use upstream patterns to collapse extract/insert_slice. PiperOrigin-RevId: 506358242 * Modify LiteralTestUtil to ensure dynamic dimensions are equivalent when checking equality. Previously the LiteralTestUtil would consider two dynamic literals equal as long as they had identical elements (even if they had different dynamic dimensions). PiperOrigin-RevId: 506359222 * feat: update boringssl to fix aarch64 build failures PiperOrigin-RevId: 506366004 * [TF:PJRT] Use PjRtDeviceContext in XlaDevice. - Use AsyncValueAllocator as the allocator when PjRtDeviceContext is used. - Update places that use XlaDeviceContext as signature to DeviceContext. - Change GetXlaOpsCommonFlags to return XlaOpsCommonFlags* so that the flag tf_xla_use_device_api can be set in the test. - Implement Name() in AsyncValueAllocator which is a virtual function. PiperOrigin-RevId: 506369982 * Remove time (AKA time_fraction) field, since it's no longer used. We now compute this in the frontend to avoid storing this redundant field in the protobuf. PiperOrigin-RevId: 506372540 * Fix crash in simplifyDynamicGatherToGather DynamicGatherOp's slice_sizes is 1DTensorOf<[HLO_DimensionValue]> where HLO_DimensionValue is AnyTypeOf<[Index, HLO_Int]>. However, GatherOp's slice_sizes is I64ElementsAttr. If there's a mismatch in element types, canonicalization from DynamicGatherOp to GatherOp will crash, so in that case we need to explicitly convert the elements. PiperOrigin-RevId: 506374817 * Remove legacy references from `ops.py`. This is done to eventually remove the lazy loads in `indexed_slices.py`. PiperOrigin-RevId: 506375428 * Upgrade clang toolchain to use clang-16. PiperOrigin-RevId: 506381712 * [xla:gpu] Remove check on XLA_FLAGS when doing deserialization We no longer need this because XLA Runtime is enabled by default. PiperOrigin-RevId: 506382068 * Adding profiler assertions for TPU. PiperOrigin-RevId: 506382173 * Fix a operand does not dominate bug caused by tpu_extract_outside_compilation. tpu_extract_outside_compilation can create a _XlaRecvAtHostOp or a _XlaRecvAtHostV2Op op to receive at the host side. Its operand identification function (GetStaticExternalOperands) avoids including operands that are already on the host by checking if they are set by a recv, since a recv would have been created already. The bug was that only _XlaRecvAtHostV2Op was counted as a recv, not _XlaRecvAtHostOp. PiperOrigin-RevId: 506383107 * Add copybara config tests. PiperOrigin-RevId: 506396032 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/250ad8a0ccdab9d6882931d0dcdfa8fa73eceadf. PiperOrigin-RevId: 506399106 * [IFRT] Add additional ArrayImpl tests with various host buffer semantics Additional tests verify that the `Array` implementation meet the API contract as defined by `HostBufferSemantics`. This change also adds a revised version of the `PjRtBuffer::HostBufferSemantics` comment. It does not yet define a new IFRT `HostBufferSemantics` type yet for a JAX compatibility. PiperOrigin-RevId: 506401836 * [GmlSt] Use the gml-st-cpu-tiling-pipeline to test transformations. We used to test separate transformation patterns that don't include vectorization. These tests check the transformation + vectorization. Later there will be additional CHECKs in the same files for bufferization to verify that we don't allocate inside the loops. Reduce and matmul will be updated in a follow-up. PiperOrigin-RevId: 506408069 * Silence some pytype errors. PiperOrigin-RevId: 506409026 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/c6df8512943f31fbf1d2cf3fcdcbc6bc1aa747db. PiperOrigin-RevId: 506409909 * Add a DistributedValue that is backed by DTensor instance. This can be used as the input and output value for a strategy.run function. PiperOrigin-RevId: 506414540 * Redirect usages of `convert_variables_to_constants` from `graph_util_impl.py` to `convert_to_constants.py` to remove a cycle. PiperOrigin-RevId: 506414914 * Add out of bounds array check to dynamic_stitch_op. PiperOrigin-RevId: 506418249 * Partial rollforward of PR #59315. Bring back two following fixes for TF MacOS + Metal PluggableDevice: - TensorList op exclusion for MacOS - Temporary hack to avoid jit_compile on MacOS. Eigen buffer alignment fix is not included in this rollforward and will be in a separate commit. END_PUBLIC *** Reason for rollback *** Partial rollforward of PR #59315. *** Original change description *** Automated g4 rollback of changelist 504212615. Rollback PR #59315. Breaks MacOS tests. For eg: tensorflow/core/framework:tensor_test PiperOrigin-RevId: 506419803 * [XLA] Create skeleton for a partition assignment pass, which annotates the given module with (good) shardings, by adding: - an HLO pass: PartitionAssignment - a base class: PartitioningAlgorithm, - a no-op derived class extending PartitioningAlgorithm: NoopPartitioning, and - a flag to determine the algorithm (kind/type): xla_partitioning_algorithm. PiperOrigin-RevId: 506423268 * [XLA:CPU] Add concat benchmarks PiperOrigin-RevId: 506427653 * Fix memory corruption vulnerability in reverse_sequence_op. PiperOrigin-RevId: 506433062 * [XLA] Support merging partially replicated dimension in complex resharding code. PiperOrigin-RevId: 506433374 * Implement Tensorflow verification pass that ensures no TF dialect ops remain. Required so that we can remove the allow_partial_conversion check in LegalizeTF, which is required to only call LegalizeTF once. PiperOrigin-RevId: 506434351 * Delete SetPayload(absl::string_view, absl::string_view); PiperOrigin-RevId: 506438247 * Integrate LLVM at llvm/llvm-project@dbd02002dd0c Updates LLVM usage to match [dbd02002dd0c](https://github.com/llvm/llvm-project/commit/dbd02002dd0c) PiperOrigin-RevId: 506440373 * [Tensorflow] Fix security vulnerability with TensorListSplitOp PiperOrigin-RevId: 506441188 * Remove unused code in cost analysis PiperOrigin-RevId: 506441280 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/851d62673267a061aab673a33fa9ad37a5aa39fb. PiperOrigin-RevId: 506442730 * Replace `error_message()` with `message()` since we have upgraded to a newer protobuf PiperOrigin-RevId: 506443040 * #tf-data-service Add a test util for `choose_from_datasets`. PiperOrigin-RevId: 506444429 * #tf-data-service Add a check for infinite datasets. The next step is to support `repeat`, for example: datasets = [tf.data.Dataset.from_tensors("a").repeat(10), tf.data.Dataset.from_tensors("b").repeat(10), tf.data.Dataset.from_tensors("c").repeat(10)] choice_dataset = tf.data.Dataset.range(3).repeat() dataset = tf.data.Dataset.choose_from_datasets(datasets, choice_dataset) PiperOrigin-RevId: 506448078 * Recognize empty input_signatures with default value parameters PiperOrigin-RevId: 506449019 * Limit the thread pool size of the TFE context used for constant folding PiperOrigin-RevId: 506454804 * [jax] Skip compilation cache test for older jaxlibs PiperOrigin-RevId: 506460144 * Add back CLANG_CUDA_COMPILER_PATH to gpu_clang.bazelrc. PiperOrigin-RevId: 506468121 * Rollback the change to add GPU PJRT client. PiperOrigin-RevId: 506477686 * - Add _cast() to TraceType - Implement _cast() to default types and TensorSpec PiperOrigin-RevId: 506479924 * Cast `status.message()` explicitly to `std::string` PiperOrigin-RevId: 506502241 * Canonicalize RealDynamicSliceOp to DynamicSliceOp We know how to canonicalize RealDynamicSliceOp to SliceOp (when all attributes are static), but there is one more case when RealDynamicSliceOp can be canonicalized to a simpler op // Before rewrite %slice_sizes = mhlo.constant ... %limit_indices = mhlo.add %start_indices, %slice_sizes %strides = mhlo.constant dense<1> %result = mhlo.real_dynamic_slice %operand, %start_indices, %limit_indices, %strides // After rewrite %result = "mhlo.dynamic_slice"(%operand, %start_indices0, ...) { slice_sizes = ... } PiperOrigin-RevId: 506504799 * Disable `tensorflow/dtensor/python/tests:spmd_test` on Python 3.8 PiperOrigin-RevId: 506505212 * Disable `tensorflow/dtensor/python/tests:multi_client_test_nccl_local` on OSS PiperOrigin-RevId: 506507742 * Add Python specific disable tags to the bazel configs PiperOrigin-RevId: 506508975 * [Tensorflow] Fix security vulnerability with DenseBincountOp PiperOrigin-RevId: 506514542 * Update Eigen to commit:3460f3558e7b469efb8a225894e21929c8c77629 CHANGELOG ========= 3460f3558 - Use VERIFY_IS_EQUAL to compare to zeros. 13a1f25da - Revert StlIterators edit from "Fix undefined behavior..." fd2fd4870 - Update file ForwardDeclarations.h 37b2e9717 - Tweak special case handling in atan2. a1cdcdb03 - Fix undefined behavior in Block access 4a58f30aa - Fix pre-POWER8_VECTOR bugs in pcmp_lt and pnegate and reactivate psqrt. 12ad99ce6 - Remove unused variables from GenericPacketMathFunctions.h 6987a200b - Fix stupid sparse bugs with outerSize == 0 0471e61b4 - Optimize various mathematical packet ops 1aa6dc200 - Fix sparse warnings 17ae83a96 - Fix bugs exposed by enabling GPU asserts. ab8725d94 - Turn off vectorize version of rsqrt - doesn't match generic version 6d9f662a7 - Tweak atan2 6fc9de7d9 - Fix slowdown in bfloat16 MMA when rows is not a multiple of 8 or columns is not a multiple of 4. 6d4221af7 - Revert qr tests 7f58bc98b - Refactor sparse 576448572 - More fixes for __GNUC_PATCHLEVEL__. 164ddf75a - Use __GNUC_PATCHLEVEL__ rather than __GNUC_PATCH__, according to the documentation https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html 5a7ca681d - Fix sparse insert 08c961e83 - Add custom ODR-safe assert. 3fe8c5110 - Replace the Deprecated `$<CONFIGURATION>` with `$<CONFIG>` d70b4864d - issue #2581: review and cleanup of compiler version checks b52312068 - [SYCL-2020 Support] Enabling Intel DPCPP Compiler support to Eigen bae119bb7 - Support per-thread is_malloc_allowed() state fa0bd2c34 - improve sparse permutations 2e61c0c6b - Add missing EIGEN_DEVICE_FUNC in a few places when called by asserts. 4aca06f63 - avoid move assignment in ColPivHouseholderQR 68082b822 - Fix QR, again 4d0576534 - Altivec fixes for Darwin: do not use unsupported VSX insns PiperOrigin-RevId: 506525228 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/ec4a1c4d591c9c5be3ae207551452f2f667177c7. PiperOrigin-RevId: 506526946 * Update GraphDef version to 1395. PiperOrigin-RevId: 506547190 * compat: Update forward compatibility horizon to 2023-02-02 PiperOrigin-RevId: 506547933 * [XLA] Use wide accumulator for integer types in HloEvaluator. Generally, this should not affect the operations, as the results are downcasted to ReturnT. Some integer operations (SHR, CLZ, popcnt) were updated, as they didn't previously support cases where ReturnT != ElementwiseT For convolutions, clamp the result to the ReturnT range, as discarding the high bits doesn't make sense. This allows to enable convolution tests that would otherwise fail (cl/506267884). PiperOrigin-RevId: 506548096 * Add `diagonal_recurrent_tensors` attribute to UNIDIRECTIONAL_SEQUENCE_LSTM op. PiperOrigin-RevId: 506553811 * Add test for tf.TensorScatterAdd PiperOrigin-RevId: 506561719 * Reference the `benchmark_model` instructions from the delegate performance benchmark README. Running `benchmark_model` can be useful for quick feedback during the early stages of development. PiperOrigin-RevId: 506568968 * Add convolution tests for int8x32 cuDNN vectorized layout PiperOrigin-RevId: 506573468 * [GmlSt] Split and clean-up codegen tests for matmul. PiperOrigin-RevId: 506574062 * Also log the execution time in run_hlo_module. replay_computation has this functionality, and the goal is to replace it with run_hlo_module. PiperOrigin-RevId: 506584404 * Integrate LLVM at llvm/llvm-project@7d3a181c8c18 Updates LLVM usage to match [7d3a181c8c18](https://github.com/llvm/llvm-project/commit/7d3a181c8c18) PiperOrigin-RevId: 506591519 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/49fef8924ba03c721f5f1125df217d902c42d1c3. PiperOrigin-RevId: 506593472 * [GmlSt] Split and clean-up codegen tests for reduce. PiperOrigin-RevId: 506601849 * [XLA:TPU] Avoids serializing large literals that can cause high compilation latencies. PiperOrigin-RevId: 506622024 * hide extra devices instead of raising an error. This change relaxed DTensor's safety check in the NCCL path too, to conform that tf's set visible device doesn't affect the physical device list. PiperOrigin-RevId: 506643035 * Cleanup: rename initializers_v2.py to initializers.py. PiperOrigin-RevId: 506645119 * Suppress a noisy log line. PiperOrigin-RevId: 506651054 * Provide a better error message in case of compilation failure PiperOrigin-RevId: 506656750 * Register a custom codec for `extension_type.ExtensionTypeSpec` to remove `nested_structure_coder.py`'s dependency on `extension_type.py`. PiperOrigin-RevId: 506657587 * Fix use-after-move in iterator_ops.cc PiperOrigin-RevId: 506659126 * [XNNPACK] Fix some error logging in delegate logging_context can be nullptr, so use a different macro for error logging PiperOrigin-RevId: 506664135 * Register codecs for `row_partition.RowPartitionSpec` and `ragged_tensor.RaggedTensorSpec` to remove `nested_structure_coder.py`'s dependency on them. PiperOrigin-RevId: 506666103 * Change how -Xcuda-fatbinary is passed depending on the compiler used. PiperOrigin-RevId: 506667062 * Register a codec for `resource_variable_ops.VariableSpec` to remove `nested_structure_coder.py`'s dependency on `resource_variable_ops.py`. PiperOrigin-RevId: 506676601 * [xla:cpu:next] Add remove-copies-to-out-params pass To remove redundant allocations and subsequent copies to out parameters, which come from buffer allocation. The reason why these exist is that during bufferization we must allocate a buffer for each returned result. It is only post-bufferization that we run BufferResultsToOutParams, which inserts copies to those "out" buffers from the allocated ones. The pass added here detects this pattern and removes the allocation and copy, using each output buffer directly. Example input: ``` func.func @main(%arg0: tensor<1024xf64>) -> tensor<1024xf64> { %0 = mhlo.add %arg0, %arg0 : tensor<1024xf64> return %0 : tensor<1024xf64> } ``` $ xla-opt -split-input-file -hlo-xla-runtime-pipeline %s - Before: ``` module { func.func @main(%arg0: memref<1024xf64>, %arg1: memref<1024xf64>) { %c1024 = arith.constant 1024 : index %c0 = arith.constant 0 : index %c8 = arith.constant 8 : index %cst = arith.constant 0.000000e+00 : f64 %alloc = memref.alloc() {alignment = 64 : i64} : memref<1024xf64> scf.parallel (%arg2) = (%c0) to (%c1024) step (%c8) { %subview = memref.subview %alloc[%arg2] [8] [1] : memref<1024xf64> to memref<8xf64, strided<[1], offset: ?>> %0 = vector.transfer_read %arg0[%arg2], %cst {in_bounds = [true]} : memref<1024xf64>, vector<8xf64> %1 = arith.addf %0, %0 : vector<8xf64> vector.transfer_write %1, %subview[%c0] {in_bounds = [true]} : vector<8xf64>, memref<8xf64, strided<[1], offset: ?>> scf.yield } memref.copy %alloc, %arg1 : memref<1024xf64> to memref<1024xf64> return } } ``` - After: ``` module { func.func @main(%arg0: memref<1024xf64>, %arg1: memref<1024xf64>) { %c1024 = arith.constant 1024 : index %c0 = arith.constant 0 : index %c8 = arith.constant 8 : index %cst = arith.constant 0.000000e+00 : f64 scf.parallel (%arg2) = (%c0) to (%c1024) step (%c8) { %subview = memref.subview %arg1[%arg2] [8] [1] : memref<1024xf64> to memref<8xf64, strided<[1], offset: ?>> %0 = vector.transfer_read %arg0[%arg2], %cst {in_bounds = [true]} : memref<1024xf64>, vector<8xf64> %1 = arith.addf %0, %0 : vector<8xf64> vector.transfer_write %1, %subview[%c0] {in_bounds = [true]} : vector<8xf64>, memref<8xf64, strided<[1], offset: ?>> scf.yield } return } } ``` PiperOrigin-RevId: 506678216 * Register a codec for `tensor_array_ops.TensorArraySpec` to remove `nested_structure_coder.py`'s dependency on `tensor_array_ops.py`. PiperOrigin-RevId: 506681769 * [mhlo] Remove the undefined AllReduceOp build(). PiperOrigin-RevId: 506683695 * Use graph export pipeline V2 in TPU Bridge This new graph export pipeline can avoid to generate some unnecessary control dependencies, bring better performance and make the control dependencies more readable. PiperOrigin-RevId: 506687026 * Move custom codecs for TensorSpec and BoundedTensorSpec to `tensor_spec.py`. Register a codec for `sparse_tensor.SparseTensorSpec`. PiperOrigin-RevId: 506690720 * Factor out get_default_ops and make get_ops_from_nodedef a public method in TF selective_registration_header_lib. PiperOrigin-RevId: 506697634 * Integrate LLVM at llvm/llvm-project@6dd84983d0c1 Updates LLVM usage to match [6dd84983d0c1](https://github.com/llvm/llvm-project/commit/6dd84983d0c1) PiperOrigin-RevId: 506708273 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/4487e42c7e3dc1f6d641bb1c98b01990fbbbc167. PiperOrigin-RevId: 506711487 * Register a codec for `indexed_slices.IndexedSlicesSpec` to remove `nested_structure_coder.py`'s dependency on `indexed_slices.py`. PiperOrigin-RevId: 506715268 * Disable tsan for distributed snapshot fault tolerance tests. PiperOrigin-RevId: 506724161 * #tf-data-service Update the default protocol in DistributedSaveOp. PiperOrigin-RevId: 506725241 * use common string for profiler lock contention detection. PiperOrigin-RevId: 506726080 * gpu_delegate: Link nativewindow PiperOrigin-RevId: 506727041 * Call XNNPACK Transpose from TFLite kernel. PiperOrigin-RevId: 506737055 * Integrate LLVM at llvm/llvm-project@16c8709cf61b Updates LLVM usage to match [16c8709cf61b](https://github.com/llvm/llvm-project/commit/16c8709cf61b) PiperOrigin-RevId: 506742350 * Fix dimension mismatch bug in MultinomialOp GPU implementation. PiperOrigin-RevId: 506744108 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/6e37e534eaa88a022470d77d457722249235d331. PiperOrigin-RevId: 506745467 * #tf-data-service Write tmp files in the same file system as the snapshot. `rename` requires the source and destination files be in the same file system. The temp files are named similar to https://github.com/tensorflow/tensorflow/blob/33722bc185e676c99f738790ef35db8479f2f7d4/tensorflow/core/data/snapshot_utils.cc#L950. PiperOrigin-RevId: 506746696 * Add a metric in the eager function runtime to measure when a tf.function should be compiled. This metric will cover all TF2 jit_compilation paths including TPU to give an accurate number for the number of tf.functions that will be compiled per device type. PiperOrigin-RevId: 506750567 * Integrate LLVM at llvm/llvm-project@10939d1d580b Updates LLVM usage to match [10939d1d580b](https://github.com/llvm/llvm-project/commit/10939d1d580b) PiperOrigin-RevId: 506761372 * #tf-data-service Use absl::Duration in dispatcher_impl. PiperOrigin-RevId: 506762070 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/91d765cad5599f9710973d3e34d4dc22583e2e79. PiperOrigin-RevId: 506763472 * Fix for the "bfc_allocator_delay" metric being registered multiple times. PiperOrigin-RevId: 506778911 * support big-endian for numpy type descriptor * Patch llvm to fix Windows build. PiperOrigin-RevId: 506800859 * Allow batch function to avoid padding the inputs. PiperOrigin-RevId: 506820270 * [XLA] Speed up constant folding by optimizing and inlining a few simple index/shape/layout utility functions. PiperOrigin-RevId: 506827782 * Integrate LLVM at llvm/llvm-project@9b7e57470155 Updates LLVM usage to match [9b7e57470155](https://github.com/llvm/llvm-project/commit/9b7e57470155) PiperOrigin-RevId: 506834755 * Update GraphDef version to 1396. PiperOrigin-RevId: 506834807 * compat: Update forward compatibility horizon to 2023-02-03 PiperOrigin-RevId: 506834823 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/27d9da88424935b171d2f28b63658d3ee85bfe5c. PiperOrigin-RevId: 506836991 * Move ReplicateTensorListInitOps pass before Canonicalizer pass to cleanup uninitialized tensor lists. PiperOrigin-RevId: 506838420 * Implement `tf.AssignVariableOp(tf.VarHandleOp, tf.ConstOp)` removal pattern into a separate pass. This was part of the `InsertRestoreOpPass` to replace the variable initialization patterns that uses consts with the initialization that uses RestoreV2 op. In order to support future plans to insert SaveV2 op and to make the passes more modular, this change splits the pattern removal part out into a separate pass named `RemoveVariableInitializationByCosntPass`. The new pass is put right after the `InsertRestoreOpPass` and the resulting exported model should be the same as without this change. PiperOrigin-RevId: 506863386 * Make to use new MLIR for 16x8 unidirectional LSTM operation PiperOrigin-RevId: 506864247 * Add tests to rigorously check channel dimension attributes Checks that all relevant attributes are properly changed for per_channel and per_tensor case. PiperOrigin-RevId: 506873551 * Fix a bug in the validation_model build macro. This build macro uses a temporary file named $(@D)/tmp, which is a problem, because this build macro is instantiated twice in the same package, and both of the generated rules are executed in parallel during the build, and both try to write to the same tmp file, and then both of them try to remove it. This leads to one of the rules failing due to the file not being found, on account of it having been removed by the other rule. The fix is to instead use $(@D)/<name>.tflite.tmp as the name of the temporary file, where <name> is the name of the rule. This is sufficient to ensure that the temporary file names used by the different instantiations of this build macro are distinct. PiperOrigin-RevId: 506882948 * Fixes crashes due to buzz input for ApproxTopK PiperOrigin-RevId: 506898015 * [JITRT] Add scatter benchmark. PiperOrigin-RevId: 506909548 * [XLA] Reduce some unnecessary string creations in xla::Printer callsites. PiperOrigin-RevId: 506909676 * always instrument module level no matter if trace is enabled. PiperOrigin-RevId: 506915862 * In ExportXlaFunctions, iterate all ops under the xla function instead of only the ones that inherit from SymbolUserOpInterface. Ideally all ops referencing functions should inherit from SymbolUserOpInterface. But that would take some time to happen. PiperOrigin-RevId: 506918556 * Move `_TensorShapeCodec` to `tensor_shape.py` to remove `nested_structure_coder.py`'s dependency on `tensor_shape.py`. PiperOrigin-RevId: 506923585 * Adds a method to build datasets on workers without creating an iterator when doing ParameterServerStrategy training. PiperOrigin-RevId: 506923786 * adding TensorShape fuzzers PiperOrigin-RevId: 506927213 * Turn lazy load of `nested_structure_coder` in `type_spec.py` into a regular import. PiperOrigin-RevId: 506927612 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/2796b3e7ea8a3e7614029f9e307c0113f8d6bb90. PiperOrigin-RevId: 506929191 * Set shared_name to node name if it is empty in BatchFunctionFallback op PiperOrigin-RevId: 506931118 * Add some missing BUILD dependencies of `structured_tensor.py`. PiperOrigin-RevId: 506932682 * Remove dependency on indexed_slices.internal_convert_to_tensor_or_indexed_slices. PiperOrigin-RevId: 506934163 * [XLA] Add way to allow propagation to output only to a subset of root instruction tuple shardings. PiperOrigin-RevId: 506935285 * Support float64 under CPU/GPU CopyToMesh PiperOrigin-RevId: 506939058 * Turn lazy loads related to `saved_model/nested_structure_coder.py` in `data/ops/` into regular imports. PiperOrigin-RevId: 506940249 * Delete legacy objects from `ops.py`. I have moved all references to these objects to reference their respective files. PiperOrigin-RevId: 506945566 * Add a new replica context for DTensor related strategy. Since DTensor operate in a global context, we mostly raise an explicit error about methods are not available in the context. This will lead to a behavior discrepancy between the existing strategy and new one. We should consider this carefully for the future plan when rebasing the strategy on top of DTensor (eg using replicated_run/spmd_run). PiperOrigin-RevId: 506945673 * Replace `nest` usage in `FuncGraph.capture_call_time_value()` with `TraceType._cast()` PiperOrigin-RevId: 506949659 * Add implementation for strategy.gather() for new MirroredStrategy. PiperOrigin-RevId: 506952602 * [Tensorflow] Fix security vulnerability with UniqueV2. The bug is that the axis index should be canonicalized when it's negative. PiperOrigin-RevId: 506966510 * [XLA:CPU] Scalarize scf.if op PiperOrigin-RevId: 506969655 * [tflite-gpu] Fix OpenGL slice calculation bug. PiperOrigin-RevId: 506971865 * Don't use xnnpack in kernel with the flag --define=tflite_with_xnnpack=false PiperOrigin-RevId: 506975198 * [XLA:TPU] Speed up constant propagation by inlining a few methods in comparison_util.h. PiperOrigin-RevId: 506977511 * Made `TensorWithLayout` an abstract class and defined `TensorWithLayoutTf` to hold the current implementation using TF runtime. Also used `ConstValueNode` to capture the information useful for TF runtime. PiperOrigin-RevId: 506978870 * Skip license check for pybind11_abseil PiperOrigin-RevId: 506979746 * Moving control_captures out of FuncGraph PiperOrigin-RevId: 506985311 * Use gfile over the native open for an implementation based on TensorFlow's C++ FileSystem API. PiperOrigin-RevId: 506985831 * Make the HloModule constructor that takes CompilationEnvironments public. PiperOrigin-RevId: 506993058 * Add float16 and float64 input&output type support for TFLite operator 'cast' Type float16 and float64 input/output for TensorFlow 'cast' operator is used in some Federated Learning models, thus adding these type supports to TFLite 'cast' op can make these operators converted to TFLite build-in ops instead of flex ops. PiperOrigin-RevId: 506997479 * color adjust for timeline PiperOrigin-RevId: 507002833 * [TF:PLUGIN] Fix a dependency issue. PiperOrigin-RevId: 507003433 * Add isinstance check for eager execution. PiperOrigin-RevId: 507003564 * -Add 4 bit support to depthwise_conv.cc and fully_connected.cc in TfLite using the reference kernels 4bit functions for those op . And added/changed supporting functions to get test to run on fully_connected_test.cc -added a 4bit Test(Simple4bit3x3FilterTest) to depthwise_conv_test.cc in Tflite which is ported from the existing Simple3x3FilterTest with adjusted PerChannelQuanization scales for 4bit input. -added a 4bit Test(SimpleTestQuantizedInt4) to fully_connected_test.cc in Tflite which is ported from the existing SimpleTestQuantizedInt8 with adjusted outputs for 4bit. PiperOrigin-RevId: 507003918 * Add Keras metrics FBetaScore and F1Score. PiperOrigin-RevId: 507013486 * Update the TensorFlow RELEASE.md on master. (We cut the branch for 2.12.0. Insert new blurb for new release notes TF 2.13.0) PiperOrigin-RevId: 507017354 * Implement functional<->regional transformation for `CaseOp` and `CaseRegionOp` Even if we already have `CaseRegionOp` as a region version of `CaseOp`, the associated transformations were missing in functional<->regional control flow transformation passes. This CL implements them. PiperOrigin-RevId: 507017912 * In the code, we have some modules that are "based" off other modules but not exact clones. Change these code locations, so these modules retain the CompilationEnvironments from the original module. PiperOrigin-RevId: 507020873 * Update the TensorFlow RELEASE.md on master. (We cut the branch for 2.12.0. Insert new blurb for new release notes TF 2.13.0) PiperOrigin-RevId: 507021191 * Break dependency between tensorflow/core/function/transform:transform and python/saved_model. PiperOrigin-RevId: 507021697 * Handle snapshot and stream completion in tf.data service dispatcher. PiperOrigin-RevId: 507023442 * Fix release notes. Becuase the automation that updates the release notes in master after branch cut for release happens has been destroyed and the step has not been done manually in time, we have commits such as https://github.com/tensorflow/tensorflow/commit/9fbf1137044ac63e296ebf73c61b1e8513149b1c# and https://github.com/tensorflow/tensorflow/commit/ba1372a41ed90aba0efa5763b06350dd0ee7074b that write the wrong contents to the release notes. PiperOrigin-RevId: 507025073 * Implement functional<->regional transformation for `CaseOp` and `CaseRegionOp` Even if we already have `CaseRegionOp` as a region version of `CaseOp`, the associated transformations were missing in functional<->regional control flow transformation passes. This CL implements them. PiperOrigin-RevId: 507030461 * #tf-data-service Wait for the snapshot DONE file in unit tests. PiperOrigin-RevId: 507030943 * Expose TF2XLA MLIR pipeline for reuse PiperOrigin-RevId: 507033096 * Update master version numbers to 2.13.0. Branch for TF 2.12 releases has been cut. Switch to new version. PiperOrigin-RevId: 507037877 * [PJRT C API] Add C API for PjRtCApiClient::LookupAddressableDevice. wrapped_device_map_ and GetCApiDevice can be removed after LookupAddressableDevice is added. PiperOrigin-RevId: 507060995 * Work around compiler bug returning an optional unique_ptr. It looks like some compilers (e.g. gcc-7) don't like returning a moveable value directly when the return type is `std::optional` (i.e. it fails to treat the returned value as an r-value and automatically construct an optional instance around it). Explicitly creating the `std::optional` and returning _that_ seems to work around the issue. PiperOrigin-RevId: 507062621 * Update GraphDef version to 1397. PiperOrigin-RevId: 507090576 * compat: Update forward compatibility horizon to 2023-02-04 PiperOrigin-RevId: 507090630 * Update the version of Estimator nightly and Keras nightly used in TensorFlow after the corresponding nightly packages with the next version are released PiperOrigin-RevId: 507168035 * Add tfstreamz for input spec mismatch cases. PiperOrigin-RevId: 507180105 * Implement shape inference for `CaseOp` and `CaseRegionOp` PiperOrigin-RevId: 507201660 * compat: Update forward compatibility horizon to 2023-02-05 PiperOrigin-RevId: 507239797 * Update GraphDef version to 1398. PiperOrigin-RevId: 507239865 * #tf-data-service Use a mock dispatcher in the split provider test. The original test works by manipulating the files. It makes the test depend on the snapshot_manager's state. When the snapshot_manager implementation changes, it could affect this test because of the structure of the test not because of a bug. Decoupling it from the dispatcher makes the test cleaner, more stable, and less likely to be flaky. PiperOrigin-RevId: 507312461 * Preserve the linear index when computing the operand of concatenate. If the Concatenate op concatenates the fastest varying dimension, we can relatively cheaply preserve the linear index. This is a HLO snippet where we see a 30% improvement with this change: HloModule concatenate ENTRY main { param = f32[100,11,12,13,160]{4,3,2,1,0} parameter(0) param2 = f32[27456000]{0} parameter(1) reshape = f32[100,11,12,13,160]{4,3,2,1,0} reshape(param2) ROOT concat = f32[100,11,12,13,320]{4,3,2,1,0} concatenate(param, reshape), dimensions={4} } PiperOrigin-RevId: 507391165 * compat: Update forward compatibility horizon to 2023-02-06 PiperOrigin-RevId: 507406616 * Update GraphDef version to 1399. PiperOrigin-RevId: 507406617 * Add math dialect. PiperOrigin-RevId: 507412764 * Integrate LLVM at llvm/llvm-project@8c712296fb75 Updates LLVM usage to match [8c712296fb75](https://github.com/llvm/llvm-project/commit/8c712296fb75) PiperOrigin-RevId: 507429387 * Update TFRT dependency to use revision http://github.com/tensorflow/runtime/commit/e94b53450349f6837d11cc39f614af86c825ef94. PiperOrigin-RevId: 507431532 * Track memref allocations, deallocations and peak heap size. PiperOrigin-RevId: 507449826 * [TF/MLIR] Supports lowering mhlo.reduce_window when there is reshape/broadcast in the divisor. PiperOrigin-RevId: 507455712 * updated TF patch PiperOrigin-RevId: 507456411 * [XLA:GPU] Do not expand Scatter ops that are deterministic when running with --xla_gpu_deterministic_ops. Currently all Scatter ops are expanded when deterministic ops are enforced. However, scatter ops on unique indices cannot have data races irrespective of the implementation. Similarly, scatter ops with associative combiner functions will compute deterministic results irrespective of the order in which the combiner function is applied. In both cases, scatter will be deterministic and expanding it is thus not required. This reduces slowdowns due to the xla_gpu_deterministic_ops flag. PiperOrigin-RevId: 507460239 * BladeDISC patch 20221101 1, Build related changes: * No force -std=c++17 for cuda. https://github.com/pai-disc/tensorflow/commit/af4d5a07589c1d30c14c76aba6592554210451a5 * workaround compilation error on GCC 7.3.1: (#19) like: undefined reference to `std::function<tensorflow::StatusOr<xla::Shape> (xla::Shape const&, bool, mlir::XlaLayoutPreference)>::function()' * [to #35355928] fix build issue when enabling MLIR_GPU_TO_CUBIN_PASS_ENABLE * disable `-Werror=unused-result` * disable `noincompatible_remove_legacy_whole_archive` * add miss dependency `//tensorflow/compiler/xla/stream_executor:dnn_proto_cc_impl` 2, hlo related changes: * [to #35377611] feat: bufferize DotOp and DotGeneralOp. remove community DotGeneralOp bufferizer as well Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/5885662 * [to #35377611] feat: bufferize ConvOp and DynamicConvOp. Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/5910553 * [to #37276187] feat: bufferize mhlo.log1pOp Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6675686 * [to #36574644] [MLIR] [DISC] Add reverse op in lmhlo Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6364791 * support RoundNearestEvenOp (#20) * support RoundOp (#21) * add const folding support for mhlo.TransposeOp * disable some static checks of mhlo.dot_general `tf.BatchMatmul(tensor<?x?x?xf32>, tensor<4x?x?xf32>)` is valid, while the `tf.BatchMatmul` tf2mhlo converter does not handle shape propagation between the lhs & rhs, leading to some of the static check of `dot_general` failed. Just disable the check as a workaround. * [to #35355928] fix missing Elementwise traits of some special ops * [to #35355928] fix a bug in lmhlo structured interface * enhance `maybeCastTo` to support casting between i64 and i32. * cast to i64 only if index type in DynamicIotaBroadcast pattern. * add a patch not to fold UnrealizedConversionCastOp with ui/si type 3, TF related changes: * lower tf.GatherV2 op to mhlo in dynamic shape * lower tf.DepthwiseConv2DNative op to mhlo in dynamic shape * lower tf.StridedSlice op to mhlo in dynamic shape * lower tf.DynamicStitchOp to mhlo in dynamic shape * lower tf.BatchMatMulOp op to mhlo in dynamic shape * lower tf.Conv2DBackpropInputOp/tf.Conv2DBackpropFilterOp to mhlo in dynamic shape * support tf.Range wit negative stride * support tf.StridedSlice with new_axis_mask * add mhlo_disc dependency in xla legalize_tf * legalize quantized tf const before lowering to mhlo * add tf2mhlo support for tf.BatchMatMul * bugfix: only handling non-const begin/end in ConvertStridedSliceOpDynamic * bugfix: using tf.TileOp static tf2mhlo conversion only when all ins/outs have static shape * bugfix: size of begin/end/strides < the rank of input * bugfix: disable TF_RandomUniformOp in tf->mhlo * fix a bug in tf.SigmoidGradOp legalize pattern * fix a bug in ConvertSplitOp pattern * fix a bug in ConvertUnpackOp pattern * [to #36775150] feat: to support multi StreamExecutor by stream as cache key In the original design of StreamExecutor, one StreamExecutor maps to each device ordinal and owns one cudnnHandle. This means in multiple stream applications, there will be only one StreamExecutor for each GPU device. We observed dramatic performance degrade due to the lock in the SE, refer to https://yuque.antfin-inc.com/pai/blade/irdw7g. This commit revises the executor cache, so that there will be multiple StreamExecutors for each gpu stream. Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6455216 * [to #36574492]feat: dynamic strided slice op supports negative strides Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6425899 3, ROCM/DCU related changes: * [to #37531008] feat: Support building for DCU Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6897356 * [to #37531008] allow asigned stream in se for rocm backend Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6918944 * [to #37383705] patch: Add shfl.sync.bfly lowering to ROCm Link: https://code.aone.alibaba-inc.com/algo/tensorflow_google/codereview/6937184 * [to #39814120] Fix config for Macro and support gfx90a * [rocm] remove -fno-canonical-system-headers as hipcc does not support * [to #37531008] expose some more methods in gpu_backend_lib.h 4, others: * update gitignore * Add llvm patch to fix error GVN on shared memory load. (#4) add folder for mhlo::ClampOp import iree in workspace file increase `kFoldOpEltLimit` to 64GB decompose disc-compiler (#27) * decompose disc-compiler * update * fix some compilation erros * Fix dynamic shape reifyReturnTypeShapes * fix dynamic shapes & mhlo op operands & proto_alls * fix llvm patch * remove iree * reverse unsigned type lowering workaround (#30) --------- Co-authored-by: Aart Bik <[email protected]> Co-authored-by: Fiona Lang <[email protected]> Co-authored-by: Peng Wang <[email protected]> Co-authored-by: Terry Heo <[email protected]> Co-authored-by: Francois Chollet <[email protected]> Co-authored-by: A. Unique TensorFlower <[email protected]> Co-authored-by: John QiangZhang <[email protected]> Co-authored-by: Eugene Burmako <[email protected]> Co-authored-by: Justin Lebar <[email protected]> Co-authored-by: Berkin Ilbeyi <[email protected]> Co-authored-by: Xin Zhou <[email protected]> Co-authored-by: Ce Zheng <[email protected]> Co-authored-by: Dan Suh <[email protected]> Co-authored-by: Dateng Lin <[email protected]> Co-authored-by: Son Tuan Vu <[email protected]> Co-authored-by: David Dunleavy <[email protected]> Co-authored-by: Alexander Belyaev <[email protected]> Co-authored-by: Ian Hua <[email protected]> Co-authored-by: Johannes Reifferscheid <[email protected]> Co-authored-by: Eugene Zhulenev <[email protected]> Co-authored-by: Matt Callanan <[email protected]> Co-authored-by: Tres Popp <[email protected]> Co-authored-by: samypr100 <[email protected]> Co-authored-by: Wilsin Gosti <[email protected]> Co-authored-by: Yang Chen <[email protected]> Co-authored-by: Faizan Muhammad <[email protected]> Co-authored-by: Vadym Matsishevskyi <[email protected]> Co-authored-by: Jieying Luo <[email protected]> Co-authored-by: Juan Martinez Castellanos <[email protected]> Co-authored-by: Brian Wieder <[email protected]> Co-authored-by: Anlun Xu <[email protected]> Co-authored-by: Roshani Narasimhan <[email protected]> Co-authored-by: Michael Delorimier <[email protected]> Co-authored-by: Hyeontaek Lim <[email protected]> Co-authored-by: Rebecca Chen <[email protected]> Co-authored-by: Scott Zhu <[email protected]> Co-authored-by: Mason Chang <[email protected]> Co-authored-by: Penporn Koanantakool <[email protected]> Co-authored-by: Frederik Gossen <[email protected]> Co-authored-by: Matthias Kramm <[email protected]> Co-authored-by: Marcello Maggioni <[email protected]> Co-authored-by: Jean-Baptiste Lespiau <[email protected]> Co-authored-by: Jorge Gorbe Moya <[email protected]> Co-authored-by: Jian Cai <[email protected]> Co-authored-by: Kuangyuan Chen <[email protected]> Co-authored-by: Nitin Srinivasan <[email protected]> Co-authored-by: Zhufeng Pan <[email protected]> Co-authored-by: Sergey Kozub <[email protected]> Co-authored-by: Aliia Khasanova <[email protected]> Co-authored-by: Matt Kreileder <[email protected]> Co-authored-by: Adrian Kuegel <[email protected]> Co-authored-by: Andrew Audibert <[email protected]> Co-authored-by: Zhi An Ng <[email protected]> Co-authored-by: Emilio Cota <[email protected]> Co-authored-by: Jared Junyoung Lim <[email protected]> Co-authored-by: Jie Sun <[email protected]> Co-authored-by: Grant Jensen <[email protected]> Co-authored-by: Antonio Sanchez <[email protected]> Co-authored-by: Ken Franko <[email protected]> Co-authored-by: Fabien Hertschuh <[email protected]> Co-authored-by: Kazuaki Ishizaki <[email protected]> Co-authored-by: Tomás Longeri <[email protected]> Co-authored-by: Bangda Zhou <[email protected]> Co-authored-by: Fergus Henderson <[email protected]> Co-authored-by: Felix Chern <[email protected]> Co-authored-by: Yifan Jiang <[email protected]> Co-authored-by: James Mullenbach <[email protected]> Co-authored-by: Justin Szaday <[email protected]> Co-authored-by: Youchuan Hu <[email protected]> Co-authored-by: Chuanhao Zhuge <[email protected]> Co-authored-by: Vinila Settem <[email protected]> Co-authored-by: Mihai Maruseac <[email protected]> Co-authored-by: Ashish Shenoy <[email protected]> Co-authored-by: Yiming Zhang <[email protected]> Co-authored-by: Thomas Joerg <[email protected]> Co-authored-by: Wenyi Zhao <[email protected]> Co-authored-by: TanyoKwok <[email protected]>

zhangxiao-stack and others added 3 commits January 5, 2023 18:03

fix hip support

62bb0dd

decompose disc-compiler (#27) (#28)

5612a8c

* decompose disc-compiler * update Co-authored-by: Yan Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix hip support #25

fix hip support #25

Uh oh!

zhangxiao-stack commented Jan 5, 2023

Uh oh!

review-notebook-app bot commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix hip support #25

Are you sure you want to change the base?

fix hip support #25

Uh oh!

Conversation

zhangxiao-stack commented Jan 5, 2023

Uh oh!

review-notebook-app bot commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants