|
|
|
|
|
|
|
JavaMigration (SDFeedback) is a library to conduct code migration with LLMs, and improves efficacy by providing feedback to LLMs as specific as possible, motivated by Teaching Large Language Models to Self-Debug.
1.1 MigrationBench: Datasets and Evaluation Framework
- 🤗 MigrationBench
is a large-scale code migration benchmark dataset at the repository level,
across multiple programming languages.
- Current and initial release includes
java 8repositories with themavenbuild system, as of May 2025. - See more details in 2. 🤗 MigrationBench Datasets
- Current and initial release includes
- MigrationBench
is the evaluation framework to assess code migration success,
from
java 8to17or any other long-term support versions.
1.2 JavaMigration (SDFeedback): Migration with LLMs
JavaMigration (SDFeedback) (current package) is to conduct code migration with LLMs as a baseline solution, and it relies on the MigrationBench package for the final evaluation.
- It builds an ECR image and then
- It runs both code migration and final evaluation with AWS Elastic Map Reduce Serverless (EMRS) in a scalable way.
2. 🤗 MigrationBench Datasets
There are three datasets in 🤗 MigrationBench:
- All repositories included in the datasets are available on GitHub, under the
MITorApache-2.0license.
| Index | Dataset | Size | Notes |
|---|---|---|---|
| 1 | 🤗 AmazonScience/migration-bench-java-full |
5,102 | Each repo has a test directory or at least one test case |
| 2 | 🤗 AmazonScience/migration-bench-java-selected |
300 | A subset of 🤗 migration-bench-java-full |
| 3 | 🤗 AmazonScience/migration-bench-java-utg |
4,814 | The unit test generation (utg) dataset, disjoint with 🤗 migration-bench-java-full |
We support running code migration for MigrationBench in two modes:
- Single job mode: For a single repository and
- Batch job mode: For multiple repositories with EMRS
- TL;DR: To run batch mode, one can skip to 3.2.2 EMRS Run directly.
To get started with code migration with LLMs from java 8 to 17,
under either minimal migration or maximal migration
(See the arXiv paper for the definition):
Verify you have java 17, maven 3.9.6 and conda (optional) locally:
# java
~ $ java --version
openjdk 17.0.15 2025-04-15 LTS
OpenJDK Runtime Environment Corretto-17.0.15.6.1 (build 17.0.15+6-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.15.6.1 (build 17.0.15+6-LTS, mixed mode, sharing)
# maven
~ $ mvn --version
Apache Maven 3.9.6 (bc0240f3c744dd6b6ec2920b3cd08dcc295161ae)
Maven home: /usr/local/bin/apache-maven-3.9.6
Java version: 17.0.15, vendor: Amazon.com Inc., runtime: /usr/lib/jvm/java-17-amazon-corretto.x86_64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.10.236-208.928.amzn2int.x86_64", arch: "amd64", family: "unix"
If you haven't done it yet, follow the instructions in MigrationBench to install Maven.
# conda (Optional)
$ conda --version
conda 25.1.1
3.1.2 Install JavaMigration (SDFeedback)
Option A: Using uv (Recommended)
git clone https://github.com/amazon-science/JavaMigration.git
cd JavaMigration/self_debug
# Create and activate virtual environment with uv
uv venv --python 3.9
source .venv/bin/activate
# Install package (MigrationBench is installed automatically as a dependency)
uv pip install -e .
# Or with dev dependencies
uv pip install -e ".[dev]"Option B: Using conda
git clone https://github.com/amazon-science/JavaMigration.git
cd JavaMigration/self_debug
# Optional: create a conda env
# conda create -n sd-feedback python=3.9
# conda activate sd-feedback
# Install package (MigrationBench is installed automatically as a dependency)
pip install -r requirements.txt -e .
# conda deactivateTo run code migration for a single repository:
# cd .../JavaMigration/self_debug/
cd src/self_debug/
# Explicit `max_iteration` will override it in the `config_file`
python run_self_debugging.py --config_file configs/java_config.pbtxt # --max_iterations 3
To run code migration in batch mode for multiple repositories,
one can run it either locally or through EMRs.
TL;DR: Local run for batch job is typically for debugging and integration test purposes, and it's NOT recommended.
See relevant spark scripts for reference:
src/self_debug/batch/spark_build.pysrc/self_debug/batch/spark_debug.py
Before submitting a job to EMRS, make sure you have the following ready:
- Set up IAM roles, network, security groups, etc correctly
- Set up ECR repository
- Set up SES (optional)
- Build an ECR image
# cd .../JavaMigration/self_debug/
cd src/self_debug/container
# To build ECR image: 552793110740.dkr.ecr.us-east-1.amazonaws.com/$USER:java
./image.sh java $USER 1 docker/java.Dockerfile # 999999999999.dkr.ecr.us-west-2.amazonaws.com
- Submit a spark job to EMRS
Note that security keys might be subject to 12h timeout.
# cd .../JavaMigration/self_debug/
cd src/self_debug/batch
# Update config file as needed for `emrs.py`, e.g. use the right ECR image in step `#1`
CONFIG=...
export APPLICATION=emrs-dbg-{user}--{date}--run00
export SCRIPT=debugger
python emrs.py --config_file=$CONFIG --application=$APPLICATION --script=$SCRIPT --user=$USER # --dry_run=1
@misc{liu2025migrationbenchrepositorylevelcodemigration,
title={MigrationBench: Repository-Level Code Migration Benchmark from Java 8},
author={Linbo Liu and Xinle Liu and Qiang Zhou and Lin Chen and Yihan Liu and Hoan Nguyen and Behrooz Omidvar-Tehrani and Xi Shen and Jun Huan and Omer Tripp and Anoop Deoras},
year={2025},
eprint={2505.09569},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2505.09569},
}