[Feat.] Refactor llm_inference/run.py to use ParallelInferenceManager with batch inference#59
Conversation
yl231
commented
Jan 16, 2026
- Replace sequential processing with parallel batch inference infrastructure
- Maintain compatibility with origin/main: same CLI interface and functionality
- Add --num-runs parameter support (default: 1) to match batch_inference.py
- Use ParallelInferenceManager for efficient parallel processing with workers
- Add README_PARALLEL.md documentation
- Replace sequential processing with parallel batch inference infrastructure - Maintain compatibility with origin/main: same CLI interface and functionality - Add --num-runs parameter support (default: 1) to match batch_inference.py - Use ParallelInferenceManager for efficient parallel processing with workers - Fix mypy type annotations: Tuple return types and Optional parameters - Add README_PARALLEL.md documentation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the LLM inference process, replacing the previous sequential logic with a robust parallel batch inference system. The new ParallelInferenceManager and batch_inference.py script are excellent additions that will greatly improve performance and maintainability. The changes are well-structured, and the new documentation is very helpful. My review includes suggestions to enhance cross-platform compatibility, simplify some of the concurrency logic for better clarity, and improve error handling and efficiency in a few areas.
- Replace sequential processing with parallel batch inference infrastructure - Maintain compatibility with origin/main: same CLI interface and functionality - Add --num-runs parameter support (default: 1) to match batch_inference.py - Use ParallelInferenceManager for efficient parallel processing with workers - Fix mypy type annotations: Tuple return types and Optional parameters - Add README_PARALLEL.md documentation Code quality improvements: - Replace fcntl with cross-platform filelock for Windows compatibility - Use threading.local() to cache ModelInference instances per thread - Remove redundant periodic cache consolidation (save_single_result handles it) - Remove os.chdir() in batch_inference.py, use absolute paths instead - Add logging for exceptions when converting model names (no silent failures) - Document clear_failed_entries as utility function (not auto-called)
- Call clear_failed_entries automatically before processing each model - Clear failed entries before processing so they can be retried - Update function documentation to reflect automatic usage - Log number of failed entries cleared for visibility
jiarong0907
left a comment
There was a problem hiding this comment.
lgtm. One comment: You can directly call it README since it appears in a different folder.