Skip to content

Commit 3407aee

Browse files
committed
update src reorg
1 parent b8dc579 commit 3407aee

20 files changed

+46
-66
lines changed

README-ja-JP.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ data = dict(
9999

100100
Slurm環境で2ノード16カードを使用する場合、コマンドは以下の通りです:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
torchを使用し、1ノード8カードで実行する場合、コマンドは以下の通りです:

README-zh-Hans.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ data = dict(
9999

100100
slurm环境,双机16卡,启动训练命令如下:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
torch环境,单机8卡,启动训练命令如下:

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Training can be started on slurm or torch distributed environment.
9999

100100
On slurm, using 2 nodes and 16 cards, the command is as follows:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
On torch, using 1 node and 8 cards, the command is as follows:

doc/code-docs/source/example/20B_demo.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@
167167

168168
.. code-block:: bash
169169
170-
srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/20B_sft.py
170+
srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/20B_sft.py
171171
172172
训练结果
173173
----------------

doc/code-docs/source/example/7B_demo.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@
165165

166166
.. code-block:: bash
167167
168-
srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
168+
srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
169169
170170
训练结果
171171
----------------

doc/en/usage.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -407,7 +407,7 @@ After completing the data preparation and relevant training configurations menti
407407
If you want to start distributed training on slurm with 16 GPUs across multiple nodes, use the following command:
408408

409409
```bash
410-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
410+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
411411
```
412412

413413
If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:

doc/usage.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -453,7 +453,7 @@ parallel = dict(
453453

454454
若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:
455455
```bash
456-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
456+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
457457
```
458458

459459
若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:

generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from tqdm import tqdm
1616

1717
from internlm.accelerator import get_accelerator
18-
from internlm.apis.inference import SequenceGenerator
18+
from internlm.inference import SequenceGenerator
1919
from internlm.core.context import global_context as gpc
2020
from internlm.data import build_generation_loader_with_data_type
2121
from internlm.initialize import initialize_launcher

internlm/core/scheduler/base_scheduler.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
import torch
1010

11-
from internlm.apis import InferenceParams
11+
from internlm.inference import InferenceParams
1212
from internlm.core.engine import Engine
1313

1414

Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
from .inference_utils import InferenceParams, process_parallel_output
2+
from .inference import SequenceGenerator, batch_tokenize
23

34
__all__ = [
45
"InferenceParams",
56
"process_parallel_output",
7+
"SequenceGenerator",
8+
"batch_tokenize",
69
]

internlm/apis/inference.py internlm/inference/inference.py

+1-22
Original file line numberDiff line numberDiff line change
@@ -3,33 +3,12 @@
33

44
import torch
55
import torch.nn.functional as F
6-
from torch import nn
76

8-
from internlm.apis import InferenceParams, process_parallel_output
7+
from internlm.inference import InferenceParams, process_parallel_output
98
from internlm.core.context import ParallelMode # noqa: E402
109
from internlm.core.context import global_context as gpc # noqa: E402
1110
from internlm.core.trainer import Trainer
1211

13-
__all__ = ["SequenceGenerator"]
14-
15-
16-
def _get_model_device(model):
17-
"""
18-
obtain the device of an nn.Module.model
19-
20-
Args:
21-
model: nn.Module
22-
23-
Return: torch.device. if None, the parameters of this model is None.
24-
"""
25-
assert isinstance(model, nn.Module)
26-
27-
parameters = list(model.parameters())
28-
if len(parameters) == 0:
29-
return None
30-
else:
31-
return parameters[0].device
32-
3312

3413
class SequenceGenerator:
3514
"""
File renamed without changes.
File renamed without changes.

internlm/launcher/launch.py internlm/launch/launcher.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# -*- encoding: utf-8 -*-
33

44
from internlm.core.context import global_context as gpc
5-
from internlm.core.trainer_builder import TrainerBuilder
5+
from internlm.launch.trainer_builder import TrainerBuilder
66
from internlm.data import (
77
build_train_loader_with_data_type,
88
build_valid_loader_with_data_type,
File renamed without changes.

internlm/monitor/monitor.py

+29-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from datetime import datetime
12
import fcntl
23
import logging
34
import os
@@ -15,7 +16,6 @@
1516
from internlm.monitor import send_feishu_msg_with_webhook
1617
from internlm.utils.common import SingletonMeta, set_env_var
1718

18-
from .utils import get_job_key
1919

2020
logger = logging.getLogger(__file__)
2121
internlm_accelerator = get_accelerator()
@@ -55,6 +55,34 @@ def execute_with_exception_handling(func, *args, **kwargs):
5555
return decorator
5656

5757

58+
def now_time():
59+
return datetime.now().strftime("%b%d_%H-%M-%S")
60+
61+
62+
def get_job_id():
63+
job_id = "none"
64+
if os.getenv("SLURM_JOB_ID") is not None:
65+
job_id = os.getenv("SLURM_JOB_ID")
66+
elif os.getenv("KUBERNETES_POD_NAME") is not None:
67+
job_id = os.getenv("KUBERNETES_POD_NAME").split("-")[0]
68+
elif os.getenv("MLP_TASK_INSTANCE_ID") is not None:
69+
job_id = os.getenv("MLP_TASK_ID")
70+
71+
return job_id
72+
73+
74+
def get_job_name():
75+
job_name = f"unknown-{now_time()}"
76+
if os.getenv("JOB_NAME") is not None:
77+
job_name = os.getenv("JOB_NAME")
78+
79+
return job_name
80+
81+
82+
def get_job_key():
83+
return f"{get_job_id()}_{get_job_name()}"
84+
85+
5886
def send_alert_message(address: str = None, title: str = None, message: str = None):
5987
"""
6088
Send alert messages to the given Feishu webhook address in log rank.

internlm/monitor/utils.py

-30
This file was deleted.

tests/test_infer/test_generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import torch
55
from sentencepiece import SentencePieceProcessor
66

7-
from internlm.apis.inference import SequenceGenerator, batch_tokenize
7+
from internlm.inference import SequenceGenerator, batch_tokenize
88
from internlm.initialize import initialize_launcher # noqa: E402
99
from internlm.initialize.initialize_model import (
1010
initialize_model_and_parallel_communicator,

tests/test_infer/test_trainer_generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
import pytest
44
from sentencepiece import SentencePieceProcessor
55

6-
from internlm.apis.inference import SequenceGenerator, batch_tokenize
6+
from internlm.inference import SequenceGenerator, batch_tokenize
77
from internlm.checkpoint import CheckpointManager # noqa: E402
88
from internlm.core.context import global_context as gpc # noqa: E402
99
from internlm.core.trainer import Trainer, TrainState # noqa: E402

tools/load_internlm2_model.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
import torch
99

10-
from internlm.apis.inference import SequenceGenerator
10+
from internlm.inference import SequenceGenerator
1111
from internlm.core.context import ParallelMode
1212
from internlm.core.context import global_context as gpc
1313
from internlm.initialize import initialize_launcher

0 commit comments

Comments
 (0)