Skip to content

Commit 841125e

Browse files
committed
update src reorg
1 parent b8dc579 commit 841125e

40 files changed

+331
-390
lines changed

README-ja-JP.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ data = dict(
9999

100100
Slurm環境で2ノード16カードを使用する場合、コマンドは以下の通りです:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
torchを使用し、1ノード8カードで実行する場合、コマンドは以下の通りです:

README-zh-Hans.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ data = dict(
9999

100100
slurm环境,双机16卡,启动训练命令如下:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
torch环境,单机8卡,启动训练命令如下:

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Training can be started on slurm or torch distributed environment.
9999

100100
On slurm, using 2 nodes and 16 cards, the command is as follows:
101101
```bash
102-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
102+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
103103
```
104104

105105
On torch, using 1 node and 8 cards, the command is as follows:

ci_scripts/train/load_ckpt.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ if [[ ! -f ${file} ]]; then
2222
exit_code=$(($exit_code + 1))
2323
fi
2424

25-
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launcher/launch.py --config ${file}
25+
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launch/launcher.py --config ${file}
2626
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }
2727

2828

ci_scripts/train/slurm_train.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ if [[ -d ${CKPTS20_PATH} ]]; then
2222
fi
2323
fi
2424

25-
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launcher/launch.py --config ./ci_scripts/train/ci_7B_sft.py
25+
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python internlm/launch/launcher.py --config ./ci_scripts/train/ci_7B_sft.py
2626
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }
2727

2828
num=$(num_files "${CKPTS20_OUTPUT}")

ci_scripts/train/torchrun.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ if [[ -d ${CKPTS20_PATH} ]]; then
2222
fi
2323
fi
2424

25-
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 internlm/launcher/launch.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
25+
srun -p ${SLURM_PARTITION} --kill-on-bad-exit=1 --exclusive --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 internlm/launch/launcher.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
2626
[[ $? -ne 0 ]] && { echo "test torch training failed."; exit_code=$(($exit_code + 1)); }
2727

2828
num=$(num_files "${CKPTS_OUTPUT}")

doc/code-docs/source/example/20B_demo.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@
167167

168168
.. code-block:: bash
169169
170-
srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/20B_sft.py
170+
srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/20B_sft.py
171171
172172
训练结果
173173
----------------

doc/code-docs/source/example/7B_demo.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@
165165

166166
.. code-block:: bash
167167
168-
srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
168+
srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
169169
170170
训练结果
171171
----------------

doc/en/usage.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -407,7 +407,7 @@ After completing the data preparation and relevant training configurations menti
407407
If you want to start distributed training on slurm with 16 GPUs across multiple nodes, use the following command:
408408

409409
```bash
410-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
410+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
411411
```
412412

413413
If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:

doc/usage.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -453,7 +453,7 @@ parallel = dict(
453453

454454
若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:
455455
```bash
456-
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launcher.launch --config ./configs/7B_sft.py
456+
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python -m internlm.launch.launcher --config ./configs/7B_sft.py
457457
```
458458

459459
若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:

generate.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from tqdm import tqdm
1616

1717
from internlm.accelerator import get_accelerator
18-
from internlm.apis.inference import SequenceGenerator
18+
from internlm.inference import SequenceGenerator
1919
from internlm.core.context import global_context as gpc
2020
from internlm.data import build_generation_loader_with_data_type
2121
from internlm.initialize import initialize_launcher

internlm/checkpoint/checkpoint_manager.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from internlm.accelerator import get_accelerator
1111
from internlm.core.context import ParallelMode
1212
from internlm.core.context import global_context as gpc
13-
from internlm.core.trainer import TrainState
13+
from internlm.train.train_state import TrainState
1414
from internlm.model.model_implementations.registry import model_initializer
1515
from internlm.model.model_implementations.transformers.base_model import (
1616
BaseTransformerModel,

internlm/checkpoint/components.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from internlm.accelerator import get_accelerator
99
from internlm.core.context import ParallelMode
1010
from internlm.core.context import global_context as gpc
11-
from internlm.core.trainer import TrainState
11+
from internlm.train.train_state import TrainState
1212
from internlm.model.model_ops.moe import MoE
1313
from internlm.solver.optimizer import HybridZeroOptimizer, HybridZeroOptimizer_v2
1414
from internlm.utils.common import get_current_device

internlm/core/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .engine import Engine
22
from .naive_amp import NaiveAMPModel
3-
from .trainer import Trainer
3+
from ..train.trainer import Trainer
44

55
__all__ = [
66
"NaiveAMPModel",

internlm/core/scheduler/base_scheduler.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88

99
import torch
1010

11-
from internlm.apis import InferenceParams
1211
from internlm.core.engine import Engine
12+
from internlm.inference import InferenceParams
1313

1414

1515
class BaseScheduler(ABC):

internlm/data/train_state.py

-19
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1+
from .inference import SequenceGenerator, batch_tokenize
12
from .inference_utils import InferenceParams, process_parallel_output
23

34
__all__ = [
45
"InferenceParams",
56
"process_parallel_output",
7+
"SequenceGenerator",
8+
"batch_tokenize",
69
]

internlm/apis/inference.py internlm/inference/inference.py

+2-23
Original file line numberDiff line numberDiff line change
@@ -3,32 +3,11 @@
33

44
import torch
55
import torch.nn.functional as F
6-
from torch import nn
76

8-
from internlm.apis import InferenceParams, process_parallel_output
97
from internlm.core.context import ParallelMode # noqa: E402
108
from internlm.core.context import global_context as gpc # noqa: E402
11-
from internlm.core.trainer import Trainer
12-
13-
__all__ = ["SequenceGenerator"]
14-
15-
16-
def _get_model_device(model):
17-
"""
18-
obtain the device of an nn.Module.model
19-
20-
Args:
21-
model: nn.Module
22-
23-
Return: torch.device. if None, the parameters of this model is None.
24-
"""
25-
assert isinstance(model, nn.Module)
26-
27-
parameters = list(model.parameters())
28-
if len(parameters) == 0:
29-
return None
30-
else:
31-
return parameters[0].device
9+
from internlm.train.trainer import Trainer
10+
from internlm.inference import InferenceParams, process_parallel_output
3211

3312

3413
class SequenceGenerator:
File renamed without changes.

internlm/initialize/initialize_trainer.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
ZeroBubblePipelineVShapeScheduler,
2525
)
2626
from internlm.core.scheduler.pipeline_scheduler_1f1b import get_tensor_shape
27-
from internlm.core.trainer import Trainer
27+
from internlm.train.trainer import Trainer
2828
from internlm.data.utils import packed_data_normalizer, unpack_data
2929
from internlm.solver.optimizer import BaseOptimizer
3030
from internlm.solver.schedulers import Beta2Scheduler
File renamed without changes.

internlm/core/trainer_builder.py internlm/launch/launcher.py

+51-33
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,33 @@
1-
import gc
2-
import logging
3-
import time
4-
from functools import partial
5-
from typing import Dict, List, Optional, Union
1+
#!/usr/bin/env python
2+
# -*- encoding: utf-8 -*-
63

7-
import torch
4+
from internlm.checkpoint.checkpoint_manager import CheckpointManager
85
import torch.distributed as dist
96
from torch.utils.data import DataLoader
10-
11-
from internlm.checkpoint.checkpoint_manager import CheckpointManager
12-
from internlm.core.context import ParallelMode
13-
from internlm.core.context import global_context as gpc
7+
from functools import partial
8+
from typing import Dict, List
9+
from internlm.core.context import ParallelMode, global_context as gpc
1410
from internlm.core.parallel.comm import initialize_offload_manager
15-
from internlm.core.trainer import (
16-
Trainer,
17-
get_scheduler_hooks,
18-
load_new_batch,
19-
record_current_batch_training_metrics,
11+
from internlm.train.utils import get_scheduler_hooks, load_new_batch, record_current_batch_training_metrics
12+
from internlm.data import (
13+
build_train_loader_with_data_type,
14+
build_valid_loader_with_data_type,
2015
)
2116
from internlm.data.streaming.utils import streaming_simple_resume
22-
from internlm.data.train_state import get_train_state
2317
from internlm.eval import evaluate_on_val_dls
24-
from internlm.initialize import initialize_trainer
25-
from internlm.initialize.initialize_model import (
26-
initialize_model_and_parallel_communicator,
27-
)
18+
from internlm.initialize import initialize_launcher, initialize_trainer
19+
from internlm.initialize.initialize_model import initialize_model_and_parallel_communicator
2820
from internlm.initialize.initialize_optimizer import initialize_optimizer
2921
from internlm.initialize.initialize_profiler import initialize_llm_profile
22+
from internlm.launch.trainer_builder import logger
23+
from internlm.model.model_implementations.builder import create_model
24+
from internlm.model.model_implementations.registry import register_model_initializer
3025
from internlm.model.model_ops.losses.ce_loss import InternLoss
3126
from internlm.model.model_ops.metrics import AccPerplex
32-
from internlm.monitor import send_alert_message
33-
from internlm.utils.common import (
34-
BatchSkipper,
35-
check_cuda_env,
36-
enable_pytorch_expandable_segments,
37-
get_current_device,
38-
get_megatron_flops,
39-
launch_time,
40-
)
27+
from internlm.monitor import internevo_monitor, send_alert_message
28+
from internlm.train.train_state import TrainState
29+
from internlm.train.trainer import Trainer
30+
from internlm.utils.common import BatchSkipper, check_cuda_env, enable_pytorch_expandable_segments, get_current_device, get_megatron_flops, launch_time, parse_args
4131
from internlm.utils.gputest import empty_cache_and_diag
4232
from internlm.utils.logger import get_logger
4333
from internlm.utils.megatron_timers import megatron_timer as timer
@@ -46,9 +36,6 @@
4636
from internlm.utils.utils import DataType
4737
from internlm.utils.writer import Writer
4838

49-
# global llm logger
50-
logger = logging.getLogger(__file__)
51-
5239

5340
class TrainerBuilder(Trainer):
5441
"""
@@ -117,7 +104,7 @@ def __init__(
117104
initialize_offload_manager(gpc.config.get("selective_checkpoint_offload", False))
118105

119106
# initialize train state
120-
train_state = get_train_state(train_dl)
107+
train_state = TrainState(gpc.config, train_dl.batch_sampler)
121108

122109
# initialize optimizer
123110
optimizer, beta2_scheduler, lr_scheduler = initialize_optimizer(model, isp_communicator)
@@ -385,3 +372,34 @@ def _update_profilers(self, batch_count: int, prof):
385372
self.memory_profiler.step()
386373
if batch_count % 2 == 0:
387374
prof.step()
375+
376+
377+
@internevo_monitor(feishu_alert=True, clean_run=True)
378+
def main(args):
379+
# initialize model
380+
register_model_initializer()
381+
model = create_model()
382+
383+
# initialize train dataloader
384+
train_dl, dataset_types = build_train_loader_with_data_type()
385+
386+
# initialize validation dataloader
387+
val_dls = build_valid_loader_with_data_type()
388+
389+
# build trainer
390+
merged_args = {**vars(args), "dataset_types": dataset_types}
391+
trainer = TrainerBuilder(model, train_dl, val_dls, **merged_args)
392+
393+
# training
394+
trainer.fit()
395+
396+
397+
if __name__ == "__main__":
398+
args = parse_args()
399+
400+
# Initialize distributed environment
401+
initialize_launcher(config=args.config, launcher=args.launcher, distributed_port=args.port, seed=args.seed)
402+
assert hasattr(gpc, "config") and gpc.config is not None
403+
404+
# Run the main function with parsed arguments
405+
main(args)

internlm/launcher/launch.py

-45
This file was deleted.

0 commit comments

Comments
 (0)