Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Add e2e test for tune api with LLM hyperparameter optimization #2420

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
6be7f29
add e2e test for tune api
helenxie-bit Sep 3, 2024
1a1f119
upgrade training-operator sdk
helenxie-bit Sep 3, 2024
8461a49
specify the version of training operator sdk
helenxie-bit Sep 3, 2024
c860238
fix num_labels error and update the version of training operator cont…
helenxie-bit Sep 3, 2024
216ebd9
check the version of training operator
helenxie-bit Sep 3, 2024
f6b96f5
debug
helenxie-bit Sep 3, 2024
c636493
check import path of HuggingFaceModelParams
helenxie-bit Sep 3, 2024
8180422
update the version of training operator sdk
helenxie-bit Sep 5, 2024
6101489
update the name of experiment
helenxie-bit Sep 5, 2024
d67a1b8
add step of checking pod
helenxie-bit Sep 5, 2024
295abb6
check the logs of pod
helenxie-bit Sep 5, 2024
e0a1b6d
add check
helenxie-bit Sep 5, 2024
1df7df9
check reason for imagepullbackoff
helenxie-bit Sep 5, 2024
d1e1311
revert timeout limit
helenxie-bit Sep 5, 2024
0cc319f
fix format
helenxie-bit Sep 5, 2024
0383932
extend timeout limit
helenxie-bit Sep 13, 2024
08c8634
update training operator sdk version
helenxie-bit Sep 13, 2024
7a98a00
check the logs of pod
helenxie-bit Sep 13, 2024
8862d79
rerun tests
helenxie-bit Sep 13, 2024
e4f614d
update the function of getting logs
helenxie-bit Sep 14, 2024
0385eea
add the step of describing pod
helenxie-bit Sep 14, 2024
e0c5170
check disk space
helenxie-bit Sep 14, 2024
0286f70
change work directory
helenxie-bit Sep 17, 2024
f6e5ed5
change work directory
helenxie-bit Sep 17, 2024
7ea7e43
increase timeout limit
helenxie-bit Sep 17, 2024
25d99b1
check the logs of controller and events
helenxie-bit Sep 17, 2024
fcd64fa
change work directory
helenxie-bit Sep 18, 2024
122c611
change work directory
helenxie-bit Sep 18, 2024
c1fde09
change work directory
helenxie-bit Sep 18, 2024
8ff6864
check the logs of kubelet
helenxie-bit Sep 18, 2024
da3c298
check the logs of kubelet
helenxie-bit Sep 18, 2024
a1bff26
increase cpu
helenxie-bit Sep 19, 2024
bbae57b
check the logs of training operator
helenxie-bit Sep 19, 2024
e45ceac
check the use of resources
helenxie-bit Sep 19, 2024
4ae11ed
check the logs of container 'pytorch' and 'storage_initializer'
helenxie-bit Sep 20, 2024
bedab36
fix error of checking use of resources
helenxie-bit Sep 20, 2024
7bfb3cc
add other checks to find the error reason
helenxie-bit Sep 20, 2024
efffdc2
set 'storage_config'
helenxie-bit Sep 21, 2024
2a18b17
reduce the number of tests
helenxie-bit Sep 22, 2024
c6c964b
Check container runtime logs
helenxie-bit Sep 22, 2024
28ffb96
set the driver of minikube as docker
helenxie-bit Sep 22, 2024
dc684e3
set the driver of minikube to none
helenxie-bit Sep 22, 2024
a12034c
check logs of pod
helenxie-bit Sep 24, 2024
b088815
check memory usage
helenxie-bit Sep 29, 2024
e468b27
increase 'termination_grace_period_seconds' in podspec
helenxie-bit Sep 29, 2024
64d8fef
fix annotations error
helenxie-bit Sep 29, 2024
45db42e
restart docker
helenxie-bit Sep 30, 2024
c6e91cd
delete restarting docker
helenxie-bit Sep 30, 2024
b1a2390
use original docker data directory
helenxie-bit Oct 22, 2024
e5bf840
update installation of Katib SDK with extra requires
helenxie-bit Jan 23, 2025
fca94ae
test trainer image built with cpu
helenxie-bit Jan 23, 2025
b5cae0d
Merge remote-tracking branch 'upstream/master' into e2e-test-tune-api
helenxie-bit Jan 24, 2025
a785d35
add action of free up disk space (including move docker data directory)
helenxie-bit Jan 24, 2025
865379e
delete unnecessary checks and update the part of fetching pod descrip…
helenxie-bit Jan 24, 2025
d1ea629
delete fetching pod logs
helenxie-bit Jan 25, 2025
5e2e44f
add blank line at the end of free-up-disk-space yaml file
helenxie-bit Jan 27, 2025
982e268
update experiment name
helenxie-bit Jan 27, 2025
55c404d
update test function name to be consistent with experiment name
helenxie-bit Jan 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
check the logs of container 'pytorch' and 'storage_initializer'
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
helenxie-bit committed Sep 20, 2024
commit 4ae11edbe725c52005587091b39e3f84816641fb
1 change: 0 additions & 1 deletion .github/workflows/e2e-test-tune-api.yaml
Original file line number Diff line number Diff line change
@@ -52,7 +52,6 @@ jobs:
run: |
kubectl get pods -n default
POD_NAME=$(kubectl get pods -n default --no-headers -o custom-columns=":metadata.name" | grep tune-example-2 | grep master)
echo "Fetching logs for pod: $POD_NAME"
kubectl describe pod $POD_NAME -n default
kubectl top pods $POD_NAME
kubectl get events -n default | grep "tune-example-2"
16 changes: 14 additions & 2 deletions test/e2e/v1beta1/scripts/gh-actions/run-e2e-tune-api.py
Original file line number Diff line number Diff line change
@@ -31,12 +31,24 @@ def get_experiment_pods_logs(katib_client: KatibClient, exp_name: str, exp_names
logging.info(f"Fetching logs for pod: {pod.metadata.name}")
try:
# Specify the container name when retrieving logs
pod_logs = v1.read_namespaced_pod_log(
pod_logs1 = v1.read_namespaced_pod_log(
name=pod.metadata.name,
namespace=exp_namespace,
container="metrics-logger-and-collector" # Specify the desired container
)
logging.info(f"Logs for pod {pod.metadata.name}:\n{pod_logs}")
logging.info(f"Logs for pod {pod.metadata.name}:\n{pod_logs1}")
pod_logs2 = v1.read_namespaced_pod_log(
name=pod.metadata.name,
namespace=exp_namespace,
container="pytorch"
)
logging.info(f"Logs for pod {pod.metadata.name}:\n{pod_logs2}")
pod_logs3 = v1.read_namespaced_pod_log(
name=pod.metadata.name,
namespace=exp_namespace,
container="storage-initializer"
)
logging.info(f"Logs for pod {pod.metadata.name}:\n{pod_logs3}")
except Exception as e:
logging.error(f"Failed to get logs for pod {pod.metadata.name}: {str(e)}")