Skip to content
Merged
1 change: 1 addition & 0 deletions cmssw-tool-conf.spec
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@ Requires: xtl
Requires: xgboost
Requires: pytorch

## INCLUDE tfaot-models
## INCLUDE cmssw-vectorization
## INCLUDE cmssw-drop-tools
## INCLUDE scram-tool-conf
1 change: 1 addition & 0 deletions pip/cms-tfaot.file
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Requires: py3-PyYAML py3-cmsml
1 change: 1 addition & 0 deletions pip/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ cleo==2.0.1
click==8.1.3
clikit==0.6.2
cmsml==0.2.2
cms-tfaot==1.0.0
contourpy==1.0.7
correctionlib==2.2.2
crashtest==0.4.1
Expand Down
22 changes: 22 additions & 0 deletions py3-cms-tfaot.spec
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
### RPM external py3-cms-tfaot 1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , for pip based packages it is batter to use the default way of building ( i.e. using https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_14_1_X/master/build-with-pip.file ). So please drop this spec and add cmsdist/pip/cms-tfaot.file with following contents

Requires: py3-cmsml py3-tensorflow
%define github_user riga
%define tag a2bbd06cbed0efa1fa191cc2f50a6b03067e59d2
%define branch master
%define source0 git+https://github.com/%{github_user}/cms-tfaot.git?obj=%{branch}/%{tag}&export=%{n}-%{realversion}&output=/%{n}-%{realversion}-%{tag}.tgz 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Changed in 7838448.

## INITENV +PATH PYTHON3PATH %i/${PYTHON3_LIB_SITE_PACKAGES}

%define github_user riga
%define tag a2bbd06cbed0efa1fa191cc2f50a6b03067e59d2
%define branch master
Source: git+https://github.com/%{github_user}/cms-tfaot.git?obj=%{branch}/%{tag}&export=%{n}-%{realversion}&output=/%{n}-%{realversion}-%{tag}.tgz

BuildRequires: py3-pip py3-setuptools py3-wheel
Requires: py3-cmsml py3-tensorflow

%prep
%setup -n %{n}-%{realversion}

%build

%install
pip3 install . --prefix=%{i} --no-deps

# copy test models
mkdir -p %{i}/share
cp -r test_models %{i}/share
1 change: 1 addition & 0 deletions python_tools.spec
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Requires: py3-keras
Requires: py3-scikit-learn
#save for the end
Requires: py3-tensorflow
Requires: py3-cms-tfaot
Requires: py3-cmsml
Requires: py3-law
Requires: py3-protobuf
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
<tool name="tensorflow-xla-runtime" version="@TOOL_VERSION@">
<client>
<environment name="TENSORFLOW_XLA_RUNTIME_BASE" default="@TOOL_ROOT@"/>
<environment name="LIBDIR" default="$TENSORFLOW_XLA_RUNTIME_BASE/lib/archive"/>
<environment name="LIBDIR" default="$TENSORFLOW_XLA_RUNTIME_BASE/lib"/>
</client>
<lib name="tf_xla_runtime-static"/>
<lib name="tf_xla_runtime"/>
<flags LDFLAGS="-Wl,--unresolved-symbols=ignore-in-shared-libs"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , we do not need these ld flags any more .. right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I now have read your #9093 (comment) . I will check what else we can do to avoid the missing symbols. We should not add this flag here otherwise scram will not fail for missing symbols for every cmssw shared lib which depend on tensorflow-xml-runtime (directly or in-directly)

Copy link
Contributor Author

@riga riga Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this already felt like a dangerous thing to do. Do you think there is any other option besides a) fetching the missing source files to tensorflow/tsl before compiling, or b) installing the full tsl as a dependency (still seems a bit like work in progress)?

Copy link
Contributor

@smuzaffar smuzaffar Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga , all the missing TSL symbols are coming from https://github.com/tensorflow/tensorflow/blob/v2.12.1/tensorflow/compiler/xla/service/cpu/runtime_fork_join.cc e.g various macros calls VLOG, CHECK_* and tsl::BlockingCounter. If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc from the compilation?

Copy link
Contributor

@smuzaffar smuzaffar Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also add -Wl,-z,defs to cmake CXXFLAGS to make sure that the shared library has no missing symbols. I mean at link time make will fail if there are any missing symbols.

Copy link
Contributor Author

@riga riga Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we can run without these missing symbols ( e.g. we never call these) so why not just then drop the runtime_fork_join.cc from the compilation?

Ok, so when we used the static lib before, the corresponding object files were likely dropped already, makes sense. And since we will potentially never use eigen thread pools for inference, we can skip it right away. Then I guess we are lucky that the tsl dependence only affects that part of the xla runtime source that we anyway don't need (well, at least for now).

We just verified that everything still works 👍


<use name="eigen"/>
<use name="tensorflow-includes"/>
</tool>
9 changes: 8 additions & 1 deletion tensorflow-xla-runtime.spec
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@

Source99: scram-tools.file/tools/eigen/env

Requires: eigen py3-tensorflow
Requires: eigen py3-tensorflow abseil-cpp
BuildRequires: cmake

%prep

cp -r ${PY3_TENSORFLOW_ROOT}/lib/python%{cms_python3_major_minor_version}/site-packages/tensorflow .

%build

source %{_sourcedir}/env
export CPATH="${CPATH}:${EIGEN_ROOT}/include/eigen3"

Expand All @@ -21,11 +22,17 @@ CXXFLAGS="-fPIC %{arch_build_flags} ${CMS_EIGEN_CXX_FLAGS}"
%endif

pushd tensorflow/xla_aot_runtime_src
# the CMakelists.txt file is not intended to produce a shared library so create a static one first
cmake . -DCMAKE_CXX_FLAGS="${CXXFLAGS}" -DCMAKE_CXX_STANDARD=%{cms_cxx_standard} -DBUILD_SHARED_LIBS=OFF
make %{makeprocesses}

# build the shared library, linking missing symbols from abseil
gcc -shared -o libtf_xla_runtime.so -Wl,--whole-archive libtf_xla_runtime.a -Wl,--no-whole-archive \
-L${ABSEIL_CPP_ROOT}/lib -labsl_strings -labsl_str_format_internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga, I changed the cmake file (CMakeLists.txt) to have

add_library(tf_xla_runtime SHARED
  $<TARGET_OBJECTS:tf_xla_runtime_objects>
)

and then looking at the missing abseil symbols, I get

                 U _ZN4absl12lts_2022062319str_format_internal10FormatPackB5cxx11ENS1_21UntypedFormatSpecImplENS0_4SpanIKNS1_13FormatArgImplEEE
                 U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchIiEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
                 U _ZN4absl12lts_2022062319str_format_internal13FormatArgImpl8DispatchISt17basic_string_viewIcSt11char_traitsIcEEEEbNS2_4DataENS1_24FormatConversionSpecImplEPv
                 U _ZN4absl12lts_202206239StrAppendEPNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_8AlphaNumE

all of these are defined in the absl_strings library. We can patch [a] xla_aot_runtime_src/CMakeLists.txt and then it should generate the shared library (properly linked to absl::strings)

[a]

--- xla_aot_runtime_src/CMakeLists.txt    2024-03-24 08:28:34.000000000 +0100
+++ xla_aot_runtime_src/CMakeLists.txt      2024-03-25 11:17:58.108587945 +0100
@@ -14,6 +14,8 @@
   -Wno-sign-compare
 )
 
-add_library(tf_xla_runtime STATIC
+find_package(absl REQUIRED)
+add_library(tf_xla_runtime SHARED
   $<TARGET_OBJECTS:tf_xla_runtime_objects>
 )
+target_link_libraries(tf_xla_runtime absl::strings)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you understand why there are missing symbols? May be it is not picking up our abseil-cpp properly?

Only partially. Indeed it seems that it didn't find the abseil-cpp libraries which is now fixed. For the missing symbols of the tsl library I think the reason is that the tensorflow_xla_runtime sources (shipped with py3-tensorflow) only provide a subset of the implementations defined by headers.

tsl

These symbols aren't needed anyway - unlike the absl ones - hence the -Wl,--unresolved-symbols=ignore-in-shared-libs in the toolfile. (I guess one could also fetch and compile the missing sources, though at the risk of inflating the runtime with unnecessary symbols.)


I added a patch in 7838448 and removed the initial static build step.

popd

%install

mkdir -p %{i}/lib/archive
mv tensorflow/xla_aot_runtime_src/libtf_xla_runtime.a %{i}/lib/archive/libtf_xla_runtime-static.a
mv tensorflow/xla_aot_runtime_src/libtf_xla_runtime.so %{i}/lib/
33 changes: 33 additions & 0 deletions tfaot-compile.file
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## tfaot common compilation and requirement file
## specs including this file should provide:
## 1. a variable %{aot_config}, pointing to the aot config file of the model to compile (required)
## 2. a variable %{aot_source}, referring to a fetched source to unpack during %prep (optional)
## (in this case a "Source" should be defined and %{aot_source} is likely %{n}-%{realversion})

BuildRequires: py3-cms-tfaot
Requires: tensorflow-xla-runtime

%prep
%if "%{?aot_source}"
%setup -n %{aot_source}
%endif

%build
cms_tfaot_compile \
--aot-config "%{aot_config}" \
--tool-name "%{n}" \
--tool-base "%{i}" \
--output-directory compiled_model

%install
mkdir -p %{i}/lib
mv compiled_model/*.o %{i}/lib/

mkdir -p %{i}/include/%{n}
mv compiled_model/*.h %{i}/include/%{n}

mkdir -p %{i}/etc/scram.d
mv compiled_model/%{n}.xml %{i}/etc/scram.d/

%post
%{relocateConfig}etc/scram.d/%{n}.xml
Copy link
Contributor

@smuzaffar smuzaffar Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riga, there is hard-coded build path in tfaot-model-*/version/include/tfaot-model-test-multi/test_multi.h. Though it is in comment section but it is better to run relocation on these generated header files. I would suggest to add the following line here to relocated all generated header files

%relocateConfigAll include/%{n} *.h

[a]

e.g.

#ifndef TFAOT_MODEL_TEST_MULTI_H
#define TFAOT_MODEL_TEST_MULTI_H

/*
 * Auto-generated AOT wrapper for
 *   model path  : /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc12/external/py3-cms-tfaot/1.0.0-b8bfd2f94522378f26f348963d4ff4ac/share/test_models/multi/saved_model
 *   prefix      : test_multi
 *   namespace   : tfaot_model
 *   class name  : test_multi
 *   batch sizes : 1, 2, 4
 */

Copy link
Contributor Author

@riga riga Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 👍
Plus, the last commit includes some final changes to the aot python tools.
The accompanying cmssw PR is also updated with adjustments to the unit tests of the dev workflow which failed tonight.

5 changes: 5 additions & 0 deletions tfaot-model-test-multi.spec
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
### RPM external tfaot-model-test-multi 1.0.0

%define aot_config $PY3_CMS_TFAOT_ROOT/share/test_models/multi/aot_config.yaml

## INCLUDE tfaot-compile
5 changes: 5 additions & 0 deletions tfaot-model-test-simple.spec
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
### RPM external tfaot-model-test-simple 1.0.0

%define aot_config $PY3_CMS_TFAOT_ROOT/share/test_models/simple/aot_config.yaml

## INCLUDE tfaot-compile
3 changes: 3 additions & 0 deletions tfaot-models.file
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# test models needed by unit tests in PhysicsTools/TensorFlowAOT
Requires: tfaot-model-test-simple
Requires: tfaot-model-test-multi