DTensor implementation #3117
                  
                    
                      YuliangLiu0306
                    
                  
                
                  started this conversation in
                Development | Core
              
            Replies: 1 comment
-
| What's the status of this work, and can I participate in it?😊 | 
Beta Was this translation helpful? Give feedback.
                  
                    0 replies
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal
We have invastigated the current implementation of DTensor from PyTorch and TensorFlow. Inspired from them, we propose a new design for DTensor.
Motivation
Supplys a uniform way for checkpointing, in automatic parallelism or other flexible distributed training paradigms, we need to save and load checkpoints in a flexible and fine-grained way.
DTensor serves as a tensor abstraction carrying distributed information. It is a key component to support both SPMD automatic parallelism and Gemini.
Refactor related components, like
CommSpec,ShardingSpec,LayoutConverter,DeviceMesh, etc. Those components were tightly coupled with Automatic parallelism feature, which makes it hard to reuse them in other components.Design
We design several components for API refactoring.

Possible class definition (pseudo-code)
DTensor
Layout
ShardingSpec
CommSpec
LayoutConverter
Future work
After refatoring/implement above features, we could use them to implement a new abstraction called

DProxyto serve as a proxy of the real tensor in automatic parallelism context. It will carry necessary information to estimate distributed operation memory/computing overhead.Self-service
Beta Was this translation helpful? Give feedback.
All reactions