GSoC 2024: Summary of LLM Hyperparameter Optimization API Project #154

helenxie-bit · 2024-09-20T23:54:00Z

This PR adds a detailed summary of my GSoC 2024 Project 4: Developing the LLM Hyperparameter Optimization API in Kubeflow's Katib. It highlights the motivation, goals, my contributions, and key lessons learned from the project.

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2024-09-20T23:54:36Z

Ref: kubeflow/katib#2339

helenxie-bit · 2024-09-21T00:06:46Z

Please review when you have time and any suggestions are welcome! Thanks! @andreyvelich @johnugeorge @terrytangyuan

Signed-off-by: helenxie-bit <[email protected]>

andreyvelich

Thank you for working on this @helenxie-bit, and sorry for the late reply!

/assign @varodrig @hbelmiro @franciscojavierarceo @kubeflow/wg-training-leads @Electronic-Waste
Please can you help us with the review, so we can merge this great blog post ?

franciscojavierarceo · 2025-02-26T21:40:22Z

This is awesome! We'll make sure to review these sooner going forward :)

franciscojavierarceo

/lgtm /approve

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

andreyvelich · 2025-02-26T22:36:44Z

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

+
+Hyperparameter optimization is a crucial but time-consuming task in fine-tuning machine learning models, especially for LLMs that involve billions of parameters. This API aims to streamline this process by abstracting the complexity of Kubernetes infrastructure, enabling data scientists to focus on model performance instead of system configuration.
+
+![Design of API](../images/2024-09-19-gsoc-2024-llm-hyperparameter-optimization-api/design_tune_api.jpg)


It would be nice to share a little bit about the feature and why it is useful for Kubeflow Katib end-users.
Maybe we can take something from your proposal or documentation PR: kubeflow/website#3952

@andreyvelich I've added another paragraph explaining the features of this API—hope it's clear! Please take a look when you have time.
I also included a link to the user guide, but since it hasn't been merged yet, I'm unsure how to link it properly. The link I'm using now seems to be temporary. Could you provide instructions on how to link it?

@helenxie-bit @mahdikhashan Should we just merge this website PR and you can address my remaining comments in the followup PR: kubeflow/website#3952 (comment) ?

Sure, that sounds great.

@helenxie-bit Please can you create an issue in kubeflow/katib to track followup updates in the website PR?

@andreyvelich Sure! I've created an issue. Please have a look.

andreyvelich · 2025-02-26T22:37:17Z

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

+Hyperparameter optimization is a crucial but time-consuming task in fine-tuning machine learning models, especially for LLMs that involve billions of parameters. This API aims to streamline this process by abstracting the complexity of Kubernetes infrastructure, enabling data scientists to focus on model performance instead of system configuration.
+
+![Design of API](../images/2024-09-19-gsoc-2024-llm-hyperparameter-optimization-api/design_tune_api.jpg)
+


Let's also cross-reference the docs for this feature, since we will merge this PR soon: kubeflow/website#3952

terrytangyuan

Let's get this one merged soon

google-oss-prow · 2025-03-06T02:06:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: franciscojavierarceo, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [franciscojavierarceo,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

franciscojavierarceo · 2025-03-06T02:07:56Z

Let's get this one merged soon

+1. I'd like us to get a GenAI page as soon as possible. :)

I'm happy to cut the draft PR.

varodrig

helenxie-bit Fantastic blog! I enjoyed reading it.
Thank you for working on this and I'm so glad you had such a great experience.

I added a few suggestions and recommendations.

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

varodrig · 2025-03-06T03:43:59Z

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

+
+## Motivation
+
+The rapid advancements and rising popularity of LLMs, such as GPT and BERT, have created a growing demand for efficient LLMOps in Kubernetes. To address this, we have developed a [train API](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance.


I have a suggestion when linking the documentation about fine-tuning, once I access the page it says it's "Old Version
This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation."
. I'd add a disclaimer in your blog that currently, the documentation is being updated to the new Trainer, but the functionality it's still valid/current.

cc @andreyvelich for comments

This functionality is not part of Kubeflow Trainer V2, since we use other methods for Fine-Tuning.

Yeah, as Andrey mentioned, Kubeflow Trainer V2 is still a work in progress, and this API is not part of it. Do you think I should remove the link to avoid any confusion?

Let's keep it for legacy docs for now, it's ok

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

varodrig · 2025-03-06T03:48:34Z

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

+- **Stage 1**: Writing the project proposal and converting it into a Kubeflow Enhancement Proposal (KEP).
+- **Stage 2**: Developing and implementing the high-level API.
+- **Stage 3**: Implementing unit tests and end-to-end tests for the API.
+- **Stage 4**: Creating documentation and presenting the work to the Kubeflow community.


Did you contribute to the API design as well? if you did I'd include it as as separate stage.

Yeah, I contributed to the API design as well, and I see it as part of the work involved in writing the proposal. I've added it and refined the wording—please take a look when you have time!

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit added 2 commits September 20, 2024 16:40

create gsoc sharing of project 4

9e80eb5

Signed-off-by: helenxie-bit <[email protected]>

delete binary file

5a6d17b

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from johnugeorge and terrytangyuan September 20, 2024 23:54

google-oss-prow bot added the size/M label Sep 20, 2024

update image

f51d238

Signed-off-by: helenxie-bit <[email protected]>

andreyvelich reviewed Feb 26, 2025

View reviewed changes

franciscojavierarceo approved these changes Feb 26, 2025

View reviewed changes

google-oss-prow bot added the approved label Feb 26, 2025

andreyvelich reviewed Feb 26, 2025

View reviewed changes

terrytangyuan approved these changes Mar 6, 2025

View reviewed changes

varodrig reviewed Mar 6, 2025

View reviewed changes

indemnifyai reviewed Mar 6, 2025

View reviewed changes

_posts/2024-09-19-gsoc-2024-summary-llm-hyperparameter-optimization-api.md Outdated Show resolved Hide resolved

helenxie-bit added 10 commits March 6, 2025 21:09

update the link of image

e7f9f8b

Signed-off-by: helenxie-bit <[email protected]>

restructure the contribution and learning journey part

d376a5a

Signed-off-by: helenxie-bit <[email protected]>

add designing API in the contribution

cbfe7e3

Signed-off-by: helenxie-bit <[email protected]>

adjust punctuation

7bc50e1

Signed-off-by: helenxie-bit <[email protected]>

rephrase sentence

51325f2

Signed-off-by: helenxie-bit <[email protected]>

adjust the level of 'Every Contribution Counts'

4dc46dd

Signed-off-by: helenxie-bit <[email protected]>

adjust the title of learning part

076dbd7

Signed-off-by: helenxie-bit <[email protected]>

add link for GSoC 2025

f3905f6

Signed-off-by: helenxie-bit <[email protected]>

adjust words in communication part

00d503f

Signed-off-by: helenxie-bit <[email protected]>

add explanation about the feature and the link to the user guide

56415b8

Signed-off-by: helenxie-bit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2024: Summary of LLM Hyperparameter Optimization API Project #154

GSoC 2024: Summary of LLM Hyperparameter Optimization API Project #154

helenxie-bit commented Sep 20, 2024

helenxie-bit commented Sep 20, 2024

helenxie-bit commented Sep 21, 2024

andreyvelich left a comment

franciscojavierarceo commented Feb 26, 2025

franciscojavierarceo left a comment

andreyvelich Feb 26, 2025

helenxie-bit Mar 7, 2025

andreyvelich Mar 7, 2025 •

edited

Loading

helenxie-bit Mar 7, 2025

andreyvelich Mar 7, 2025

helenxie-bit Mar 7, 2025

andreyvelich Feb 26, 2025

terrytangyuan left a comment

google-oss-prow bot commented Mar 6, 2025

franciscojavierarceo commented Mar 6, 2025

varodrig left a comment •

edited

Loading

varodrig Mar 6, 2025

andreyvelich Mar 6, 2025

helenxie-bit Mar 7, 2025

andreyvelich Mar 7, 2025

varodrig Mar 6, 2025

helenxie-bit Mar 7, 2025


		Hyperparameter optimization is a crucial but time-consuming task in fine-tuning machine learning models, especially for LLMs that involve billions of parameters. This API aims to streamline this process by abstracting the complexity of Kubernetes infrastructure, enabling data scientists to focus on model performance instead of system configuration.

		![Design of API](../images/2024-09-19-gsoc-2024-llm-hyperparameter-optimization-api/design_tune_api.jpg)


		## Motivation

		The rapid advancements and rising popularity of LLMs, such as GPT and BERT, have created a growing demand for efficient LLMOps in Kubernetes. To address this, we have developed a [train API](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance.

GSoC 2024: Summary of LLM Hyperparameter Optimization API Project #154

Are you sure you want to change the base?

GSoC 2024: Summary of LLM Hyperparameter Optimization API Project #154

Conversation

helenxie-bit commented Sep 20, 2024

helenxie-bit commented Sep 20, 2024

helenxie-bit commented Sep 21, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

franciscojavierarceo commented Feb 26, 2025

franciscojavierarceo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Mar 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Mar 6, 2025

franciscojavierarceo commented Mar 6, 2025

varodrig left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Mar 7, 2025 •

edited

Loading

varodrig left a comment •

edited

Loading