From 5930560ea8cbbbb17189de4f492c5fb6e3cf0441 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:25:10 +0300
Subject: [PATCH 001/201] add dataloader-echonomic

---
 RFC-0000-dataloader-echonomic.md | 180 +++++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)
 create mode 100644 RFC-0000-dataloader-echonomic.md
diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
new file mode 100644
index 00000000..05cae75c
--- /dev/null
+++ b/RFC-0000-dataloader-echonomic.md
@@ -0,0 +1,180 @@
+
+
+<details>
+<summary>Instructions - click to expand</summary>
+
+- Fork the rfcs repo: https://github.com/pytorch/rfcs
+- Copy `RFC-0000-template.md` to `RFC-00xx-my-feature.md`, or write your own open-ended proposal. Put care into the details.
+- Submit a pull request titled `RFC-00xx-my-feature`. 
+    - Assign the `draft` label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR.
+    - When opening the RFC for general discussion, copy your document into the `RFC-00xx-my-feature.md` file on the PR and assign the `commenting` label.
+- Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a [resolution template](https://github.com/pytorch/rfcs/blob/master/RFC-0000-template.md#resolution).
+    - If the RFC is idle here (no activity for 2 weeks), assign the label `stalled` to the PR.
+- Once the discussion has settled, assign a new label based on the level of support:
+    - `accepted` if a decision has been made in the RFC
+    - `draft` if the author needs to rework the RFC’s proposal
+    - `shelved` if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal
+nor about implementing the described feature until some time in the future.
+- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation. 
+- The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC.
+    - This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section)
+- Once all relevant PRs are merged, the RFC’s status label can be finally updated to `closed`.
+
+</details>
+
+
+
+
+
+# Pytorch-DataLoader-Economic
+
+**Authors:**
+* @yoadbs
+
+## **Summary**
+A new pytorch dataloader multiprocessing design is suggested. It is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT). 
+
+## **Motivation**
+Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications.
+
+By current design, dataloader multiprocessing workers simultaneously prepere batches and send them into shared memory, using a queue.
+In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
+At most, [num_workers * prefetch_factor] can be stored in shared memory at the same time.
+The main process operates in parallel, to extracts one batch after another, and inject it into the model for training/validation/test. 
+
+Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
+[num_workers] < [SERVER_RAM_AVAILABLE_BYTES] / [BATCH_SIZE_BYTES]\
+This limitation can produce a bottleneck over training time, not allowing to increase num_workers, due to server's RAM limitations.
+Alternatively, severs with more RAM can be used, increaseing severs cost.
+
+A new dataloader multiprocessing pipeline is suggested.
+In this pipline the amount of batches sent into shared memory is not dependant in [num_workers].
+This decoupling, allowes to increase [num_workers] without any significant increase in RAM consumption. 
+As in current implemnation, workers keep generating new data during the epoch (without entering idle state), to avoid TPT reduction.
+The new flow is designated to reduce RAM related bottelnecks / requirements, and improve training costeffectiveness.
+
+
+
+## **Proposed Implementation**
+### **Definitions**
+
+| symbol               | description                                                                                                                                                        |
+|----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| iw                   | items_worker (there are num_workers workers)                                                                                                                       |
+| bw                   | batch worker                                                                                                                                                       |
+| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to items_workers. Main process is putting data, and items_worker[iw] is gettting data      |
+| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from items_workers. All items workers are putting data, batch_worker[ib] is getting data  |
+| worker_result_queue  | one queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                                 |
+| item_idx             | item serial number (from epoch start)                                                                                                                              |
+| batch_idx            | batch serial number (from epoch start)                                                                                                                             |
+| item_index           | item's index, as in dataset.__getitem__(index)                                                                                                                     |
+| iw_idx               | item_worker index (which item_worker is designated to process the item)                                                                                            |  
+| bw_idx               | batch_worker index (which batch_worker is designated to process the item)                                                                                          |
+
+
+By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 
+The main process sends [prefetch_factor] batches to each worker, by index_queue (seperate queue for each worker).
+Each worker prepares the batch, and send it back to the main process through _worker_result_queue.
+Whenever a batch is retrived by the main process, another batch is sent to the appropriate worker.
+
+A new design for MultiProcessingDataLoaderIter class is suggested \
+In the suggested design, there are 2 levels of workers: 
+* items_workers - designated to generate one item at a time (by running dataset __getitem__ function), and send to shared memory 
+  * This worker is similar to current design workers, but recieving and sending one item at a time (and not one batch at a time) 
+* batchs_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
+
+By the new design, data flow will run as follows: \
+main_process -> items_workers -> batch_workers -> main_proces
+
+### **main process high-level flow**
+* Send one item at a time to items_workers (using index_queues)
+  * Each item should include (item_idx, batch_idx, item_index, iw_idx, bw_idx):
+  * Track number of items at work ("work-load") at each worker.  
+    * A different iw_idx should be selected for each item
+      * Select iw_idx of the items_worker with the minimal work-load
+    * An identical bw_idx should be selected for all items in the batch
+      * Select bw_idx of the batches_worker with the minimal work-load
+    * Make sure that the sum of items_workers work-load is always <= [prefetch_factor] * [batch_size]
+      * Stop sending items when reaching this limit
+* Retrive and store prepared batches from batches_workers (by worker_result_queue)
+  * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
+* Once the next required batch is retrived (by , return batch to caller function 
+
+### **items_worker main-loop flow**
+* get item from index_queue
+* run dataset __getitem__(item_index)
+* send item to the designated batch_worker (by item's bw_idx), through a designated queue (queue_item)
+
+### **batches_worker main-loop flow**
+* get items from all items_workers through items_queue
+* Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
+
+### **Notes**
+* A new parameter for num_batches_workers should be introduced
+  * This parameter can be set by default to prefetch_factor. There is no reason to use larger value. However, smaller value may be considered, if collate_fn is very fast
+
+## **Metrics **
+What are the main metrics to measure the value of this feature? 
+
+
+## **Drawbacks**
+Are there any reasons why we should not do this? Here we aim to evaluate risk and check ourselves.
+
+Please consider:
+* is it a breaking change?
+* Impact on UX
+* implementation cost, both in terms of code size and complexity
+* integration of this feature with other existing and planned features
+
+
+## **Alternatives**
+What other designs have been considered? What is the impact of not doing this?
+
+
+## **Prior Art**
+Discuss prior art (both good and bad) in relation to this proposal:
+* Does this feature exist in other libraries? What experience has their community had?
+* What lessons can be learned from other implementations of this feature?
+* Published papers or great posts that discuss this
+
+
+## **How we teach this**
+* What names and terminology work best for these concepts and why? How is this idea best presented?
+* Would the acceptance of this proposal mean the PyTorch documentation must be re-organized or altered?
+* How should this feature be taught to existing PyTorch users?
+
+
+## **Unresolved questions**
+* What parts of the design do you expect to resolve through the RFC process before this gets merged?
+* What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
+* What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
+
+
+## Resolution
+We decided to do it. X% of the engineering team actively approved of this change.
+
+### Level of Support
+Choose one of the following:
+* 1: Overwhelming positive feedback.
+* 2: Positive feedback.
+* 3: Majority Acceptance, with conflicting Feedback.
+* 4: Acceptance, with Little Feedback.
+* 5: Unclear Resolution.
+* 6: RFC Rejected.
+* 7: RFC Rejected, with Conflicting Feedback.
+
+
+#### Additional Context
+Some people were in favor of it, but some people didn’t want it for project X.
+
+
+### Next Steps
+Will implement it. 
+
+
+#### Tracking issue
+<github issue URL>
+
+
+#### Exceptions
+Not implementing on project X now. Will revisit the decision in 1 year.

From 89076927cb06c117368e69264cdd61365d1b3412 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:30:52 +0300
Subject: [PATCH 002/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 05cae75c..76041b6e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -32,12 +32,12 @@ nor about implementing the described feature until some time in the future.
 * @yoadbs
 
 ## **Summary**
-A new pytorch dataloader multiprocessing design is suggested. It is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT). 
+A new pytorch dataloader multiprocessing pipline is suggested. This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications.
 
-By current design, dataloader multiprocessing workers simultaneously prepere batches and send them into shared memory, using a queue.
+By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, using a queue.
 In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
 At most, [num_workers * prefetch_factor] can be stored in shared memory at the same time.
 The main process operates in parallel, to extracts one batch after another, and inject it into the model for training/validation/test. 

From 52711e4b000d5dec8abe3d0f64747968ba0a1541 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:32:07 +0300
Subject: [PATCH 003/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 76041b6e..01ae96eb 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -43,7 +43,7 @@ At most, [num_workers * prefetch_factor] can be stored in shared memory at the s
 The main process operates in parallel, to extracts one batch after another, and inject it into the model for training/validation/test. 
 
 Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
-[num_workers] < [SERVER_RAM_AVAILABLE_BYTES] / [BATCH_SIZE_BYTES]\
+[num_workers < SERVER_RAM_AVAILABLE_BYTES / BATCH_SIZE_BYTES]\
 This limitation can produce a bottleneck over training time, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, severs with more RAM can be used, increaseing severs cost.
 

From 50ad21983d835d0e2d94c6c33a7f28e9be1bf0cf Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:34:01 +0300
Subject: [PATCH 004/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 01ae96eb..483f9bb0 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -39,12 +39,12 @@ Model input batch may require significant amounts of RAM. For example, in video
 
 By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, using a queue.
 In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
-At most, [num_workers * prefetch_factor] can be stored in shared memory at the same time.
-The main process operates in parallel, to extracts one batch after another, and inject it into the model for training/validation/test. 
+At most, [num_workers * prefetch_factor] may be stored in shared memory at the same time.
+The main process operates in parallel, to extract one batch after another, and inject it into the model for training/validation/test. 
 
 Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
 [num_workers < SERVER_RAM_AVAILABLE_BYTES / BATCH_SIZE_BYTES]\
-This limitation can produce a bottleneck over training time, not allowing to increase num_workers, due to server's RAM limitations.
+This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, severs with more RAM can be used, increaseing severs cost.
 
 A new dataloader multiprocessing pipeline is suggested.

From 669e55885a8579068e2665d0bcad7650889238ae Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:34:30 +0300
Subject: [PATCH 005/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 483f9bb0..5714f133 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -45,7 +45,7 @@ The main process operates in parallel, to extract one batch after another, and i
 Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
 [num_workers < SERVER_RAM_AVAILABLE_BYTES / BATCH_SIZE_BYTES]\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, severs with more RAM can be used, increaseing severs cost.
+Alternatively, severs with more RAM can be used, to increase num_workers, increaseing severs cost.
 
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline the amount of batches sent into shared memory is not dependant in [num_workers].

From 8410ac4abe2c6769c91a50fc1fd213e06cafda68 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:37:58 +0300
Subject: [PATCH 006/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5714f133..93669efa 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,7 +50,7 @@ Alternatively, severs with more RAM can be used, to increase num_workers, increa
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline the amount of batches sent into shared memory is not dependant in [num_workers].
 This decoupling, allowes to increase [num_workers] without any significant increase in RAM consumption. 
-As in current implemnation, workers keep generating new data during the epoch (without entering idle state), to avoid TPT reduction.
+As in current implemnation, workers are not expected to enter idle state during the epoch, hence no TPT reduction is expected for the same num_workers.
 The new flow is designated to reduce RAM related bottelnecks / requirements, and improve training costeffectiveness.
 
 

From ad91f492137df332fa484a4eb64fcc088e4942fa Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:38:30 +0300
Subject: [PATCH 007/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 93669efa..4eba9b63 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,9 +51,7 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline the amount of batches sent into shared memory is not dependant in [num_workers].
 This decoupling, allowes to increase [num_workers] without any significant increase in RAM consumption. 
 As in current implemnation, workers are not expected to enter idle state during the epoch, hence no TPT reduction is expected for the same num_workers.
-The new flow is designated to reduce RAM related bottelnecks / requirements, and improve training costeffectiveness.
-
-
+The new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**
 ### **Definitions**

From 27f832dfde9151e4fe5ed834b8175c35d7d13264 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:40:50 +0300
Subject: [PATCH 008/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 4eba9b63..366bf429 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -71,9 +71,9 @@ The new flow is designated to reduce RAM related bottelnecks and/or requirements
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 
-The main process sends [prefetch_factor] batches to each worker, by index_queue (seperate queue for each worker).
-Each worker prepares the batch, and send it back to the main process through _worker_result_queue.
-Whenever a batch is retrived by the main process, another batch is sent to the appropriate worker.
+The main process sends [prefetch_factor] batches to each worker, by index_queues.
+Each worker prepares the batch, and send it back to the main process through worker_result_queue.
+After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new design for MultiProcessingDataLoaderIter class is suggested \
 In the suggested design, there are 2 levels of workers: 

From 8630a0acbc0406f1efb975258b9f0256f402cd82 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:42:11 +0300
Subject: [PATCH 009/201] aa

---
 RFC-0000-dataloader-echonomic.md | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 366bf429..2c294c20 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -60,8 +60,8 @@ The new flow is designated to reduce RAM related bottelnecks and/or requirements
 |----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | iw                   | items_worker (there are num_workers workers)                                                                                                                       |
 | bw                   | batch worker                                                                                                                                                       |
-| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to items_workers. Main process is putting data, and items_worker[iw] is gettting data      |
-| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from items_workers. All items workers are putting data, batch_worker[ib] is getting data  |
+| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data      |
+| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data  |
 | worker_result_queue  | one queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                                 |
 | item_idx             | item serial number (from epoch start)                                                                                                                              |
 | batch_idx            | batch serial number (from epoch start)                                                                                                                             |
@@ -75,26 +75,25 @@ The main process sends [prefetch_factor] batches to each worker, by index_queues
 Each worker prepares the batch, and send it back to the main process through worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
-A new design for MultiProcessingDataLoaderIter class is suggested \
-In the suggested design, there are 2 levels of workers: 
-* items_workers - designated to generate one item at a time (by running dataset __getitem__ function), and send to shared memory 
+A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 
+* item_workers - designated to generate one item at a time (by running dataset __getitem__ function), and send to shared memory 
   * This worker is similar to current design workers, but recieving and sending one item at a time (and not one batch at a time) 
 * batchs_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
 
 By the new design, data flow will run as follows: \
-main_process -> items_workers -> batch_workers -> main_proces
+main_process -> item_workers -> batch_workers -> main_proces
 
 ### **main process high-level flow**
-* Send one item at a time to items_workers (using index_queues)
+* Send one item at a time to item_workers (using index_queues)
   * Each item should include (item_idx, batch_idx, item_index, iw_idx, bw_idx):
   * Track number of items at work ("work-load") at each worker.  
     * A different iw_idx should be selected for each item
       * Select iw_idx of the items_worker with the minimal work-load
     * An identical bw_idx should be selected for all items in the batch
       * Select bw_idx of the batches_worker with the minimal work-load
-    * Make sure that the sum of items_workers work-load is always <= [prefetch_factor] * [batch_size]
+    * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]
       * Stop sending items when reaching this limit
-* Retrive and store prepared batches from batches_workers (by worker_result_queue)
+* Retrive and store prepared batches from batch_workers (by worker_result_queue)
   * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
 * Once the next required batch is retrived (by , return batch to caller function 
 
@@ -104,11 +103,11 @@ main_process -> items_workers -> batch_workers -> main_proces
 * send item to the designated batch_worker (by item's bw_idx), through a designated queue (queue_item)
 
 ### **batches_worker main-loop flow**
-* get items from all items_workers through items_queue
+* get items from all item_workers through items_queue
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **Notes**
-* A new parameter for num_batches_workers should be introduced
+* A new parameter for num_batch_workers should be introduced
   * This parameter can be set by default to prefetch_factor. There is no reason to use larger value. However, smaller value may be considered, if collate_fn is very fast
 
 ## **Metrics **

From 12609bca040a9ca6630fd53c2deadfc316f8590c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:47:18 +0300
Subject: [PATCH 010/201] aa

---
 RFC-0000-dataloader-echonomic.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2c294c20..e51ad543 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -56,18 +56,18 @@ The new flow is designated to reduce RAM related bottelnecks and/or requirements
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol               | description                                                                                                                                                        |
-|----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| iw                   | items_worker (there are num_workers workers)                                                                                                                       |
-| bw                   | batch worker                                                                                                                                                       |
-| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data      |
-| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data  |
-| worker_result_queue  | one queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                                 |
-| item_idx             | item serial number (from epoch start)                                                                                                                              |
-| batch_idx            | batch serial number (from epoch start)                                                                                                                             |
-| item_index           | item's index, as in dataset.__getitem__(index)                                                                                                                     |
-| iw_idx               | item_worker index (which item_worker is designated to process the item)                                                                                            |  
-| bw_idx               | batch_worker index (which batch_worker is designated to process the item)                                                                                          |
+| symbol               | description                                                                                                                                                      |
+|----------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| iw                   | items_worker (there are num_workers workers)                                                                                                                     |
+| bw                   | batch_worker                                                                                                                                                     |
+| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data     |
+| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data |
+| worker_result_queue  | one queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                               |
+| item_idx             | item serial number (from epoch start)                                                                                                                            |
+| batch_idx            | batch serial number (from epoch start)                                                                                                                           |
+| item_index           | item's index, as in dataset.__getitem__(index)                                                                                                                   |
+| iw_idx               | item_worker index (which item_worker is designated to process the item)                                                                                          |  
+| bw_idx               | batch_worker index (which batch_worker is designated to process the item)                                                                                        |
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 

From 4aa041b0c49bb7c6684c69b0c5810871d3c674b1 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:48:30 +0300
Subject: [PATCH 011/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e51ad543..0655f411 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -76,7 +76,7 @@ Each worker prepares the batch, and send it back to the main process through wor
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset __getitem__ function), and send to shared memory 
+* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send to shared memory 
   * This worker is similar to current design workers, but recieving and sending one item at a time (and not one batch at a time) 
 * batchs_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
 

From b81aaaa254c24423e9cc659d66554a6e27e3b997 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:50:05 +0300
Subject: [PATCH 012/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 0655f411..cba29cf8 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -76,9 +76,9 @@ Each worker prepares the batch, and send it back to the main process through wor
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send to shared memory 
-  * This worker is similar to current design workers, but recieving and sending one item at a time (and not one batch at a time) 
-* batchs_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
+* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
+  * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
+* batch_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
 
 By the new design, data flow will run as follows: \
 main_process -> item_workers -> batch_workers -> main_proces

From 701d622d6a2638459cc37aecbe433ea7a2e76c77 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:51:51 +0300
Subject: [PATCH 013/201] aa

---
 RFC-0000-dataloader-echonomic.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index cba29cf8..62f7cc75 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -76,12 +76,11 @@ Each worker prepares the batch, and send it back to the main process through wor
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
+* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, collect batch items, run collate function, and send the prepared batch back to shared memory
+* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
 
-By the new design, data flow will run as follows: \
-main_process -> item_workers -> batch_workers -> main_proces
+By the new design, data will flow by the following order: main_process -> item_workers -> batch_workers -> main_proces
 
 ### **main process high-level flow**
 * Send one item at a time to item_workers (using index_queues)

From aaed9cc8df74f8977bd2660da819593e1d2b64c6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:54:04 +0300
Subject: [PATCH 014/201] aa

---
 RFC-0000-dataloader-echonomic.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 62f7cc75..57c41683 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -83,12 +83,12 @@ A new design for MultiProcessingDataLoaderIter class is suggested. In the sugges
 By the new design, data will flow by the following order: main_process -> item_workers -> batch_workers -> main_proces
 
 ### **main process high-level flow**
-* Send one item at a time to item_workers (using index_queues)
-  * Each item should include (item_idx, batch_idx, item_index, iw_idx, bw_idx):
-  * Track number of items at work ("work-load") at each worker.  
-    * A different iw_idx should be selected for each item
+* Send one item at a time to item_workers (by index_queues)
+  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
+  * Track number of items at work ("work-load") for each worker.  
+    * A different iw_idx should be assigned to each item
       * Select iw_idx of the items_worker with the minimal work-load
-    * An identical bw_idx should be selected for all items in the batch
+    * An identical bw_idx should be assigned to all items in the same batch
       * Select bw_idx of the batches_worker with the minimal work-load
     * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]
       * Stop sending items when reaching this limit

From 254218362682852fabc06c551bb8c0bd923fe15a Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 20:58:06 +0300
Subject: [PATCH 015/201] aa

---
 RFC-0000-dataloader-echonomic.md | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 57c41683..ae9987fa 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -56,18 +56,18 @@ The new flow is designated to reduce RAM related bottelnecks and/or requirements
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol               | description                                                                                                                                                      |
-|----------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| iw                   | items_worker (there are num_workers workers)                                                                                                                     |
-| bw                   | batch_worker                                                                                                                                                     |
-| index_queue[iw]      | a queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data     |
-| item_queue[ib]       | item_queue[ib] - one queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data |
-| worker_result_queue  | one queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                               |
-| item_idx             | item serial number (from epoch start)                                                                                                                            |
-| batch_idx            | batch serial number (from epoch start)                                                                                                                           |
-| item_index           | item's index, as in dataset.__getitem__(index)                                                                                                                   |
-| iw_idx               | item_worker index (which item_worker is designated to process the item)                                                                                          |  
-| bw_idx               | batch_worker index (which batch_worker is designated to process the item)                                                                                        |
+| symbol               | description                                                                                                                                                  |
+|----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| iw                   | items_worker (there are num_workers workers)                                                                                                                 |
+| bw                   | batch_worker                                                                                                                                                 |
+| index_queue[iw]      | A queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data |
+| item_queue[ib]       | One queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data              |
+| worker_result_queue  | One queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                           |
+| item_idx             | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                                                  |
+| batch_idx            | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                                               |
+| item_index           | Item's dataset index, as in dataset.__getitem__(index)                                                                                                       |
+| iw_idx               | Item_worker index (which item_worker is designated to process the item)                                                                                      |  
+| bw_idx               | Batch_worker index (which batch_worker is designated to process the item)                                                                                    |
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 
@@ -90,8 +90,7 @@ By the new design, data will flow by the following order: main_process -> item_w
       * Select iw_idx of the items_worker with the minimal work-load
     * An identical bw_idx should be assigned to all items in the same batch
       * Select bw_idx of the batches_worker with the minimal work-load
-    * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]
-      * Stop sending items when reaching this limit
+    * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending items when reaching this limit.
 * Retrive and store prepared batches from batch_workers (by worker_result_queue)
   * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
 * Once the next required batch is retrived (by , return batch to caller function 

From 7e6974a92bebdf309b835b24cb1d6932f667cc53 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 21:04:13 +0300
Subject: [PATCH 016/201] aa

---
 RFC-0000-dataloader-echonomic.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ae9987fa..d88155aa 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -97,16 +97,17 @@ By the new design, data will flow by the following order: main_process -> item_w
 
 ### **items_worker main-loop flow**
 * get item from index_queue
-* run dataset __getitem__(item_index)
-* send item to the designated batch_worker (by item's bw_idx), through a designated queue (queue_item)
+* run dataset.\_\_getitem__(item_index)
+* send item to batch_worker by item_queue[bw_idx]
 
 ### **batches_worker main-loop flow**
-* get items from all item_workers through items_queue
+* get items from item_queue
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **Notes**
-* A new parameter for num_batch_workers should be introduced
-  * This parameter can be set by default to prefetch_factor. There is no reason to use larger value. However, smaller value may be considered, if collate_fn is very fast
+* A new dataloader parameter: num_batch_workers should be introduced
+  * By default, this parameter should be set to prefetch_factor. 
+  * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
 What are the main metrics to measure the value of this feature? 

From ce2ed98402c826f7047aeb7436d6ed1b6682b73c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 21:08:58 +0300
Subject: [PATCH 017/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d88155aa..f3a578bf 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,8 +110,10 @@ By the new design, data will flow by the following order: main_process -> item_w
   * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
-What are the main metrics to measure the value of this feature? 
-
+For similar configuration, the new flow should require significantly less shared memory, while preserving TPT.
+To monitor shared memory usage, type in linux server terminal: \
+$ monitor -n0.1 df -h \
+and review /dev/shm "used" column
 
 ## **Drawbacks**
 Are there any reasons why we should not do this? Here we aim to evaluate risk and check ourselves.

From 7a10f005d8e38d6905d7914b402ceac7de3570f3 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 14 Sep 2024 23:59:21 +0300
Subject: [PATCH 018/201] aa

---
 RFC-0000-dataloader-echonomic.md | 59 ++++++++++----------------------
 1 file changed, 18 insertions(+), 41 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f3a578bf..d3cfbfa2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,21 +37,21 @@ A new pytorch dataloader multiprocessing pipline is suggested. This pipline is d
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications.
 
-By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, using a queue.
+By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
 In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
 At most, [num_workers * prefetch_factor] may be stored in shared memory at the same time.
-The main process operates in parallel, to extract one batch after another, and inject it into the model for training/validation/test. 
+The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
-[num_workers < SERVER_RAM_AVAILABLE_BYTES / BATCH_SIZE_BYTES]\
-This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, severs with more RAM can be used, to increase num_workers, increaseing severs cost.
+[num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
+This limitation can produce a bottleneck over training TPT, by not allowing to increase num_workers, due to server's RAM limitations.
+Alternatively, in order to increase num_workers, a severs with more RAM can be used, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested.
-In this pipline the amount of batches sent into shared memory is not dependant in [num_workers].
-This decoupling, allowes to increase [num_workers] without any significant increase in RAM consumption. 
-As in current implemnation, workers are not expected to enter idle state during the epoch, hence no TPT reduction is expected for the same num_workers.
-The new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
+In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
+This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
+As in current implemnation, workers are constantly recieving items, and are not expected to enter idle state. Hence no TPT reduction is expected.
+By introducing this improvement, the new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**
 ### **Definitions**
@@ -110,44 +110,21 @@ By the new design, data will flow by the following order: main_process -> item_w
   * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
-For similar configuration, the new flow should require significantly less shared memory, while preserving TPT.
+The new flow should require significantly less shared memory, while preserving TPT. \
 To monitor shared memory usage, type in linux server terminal: \
 $ monitor -n0.1 df -h \
-and review /dev/shm "used" column
+and review /dev/shm "used" column.
 
 ## **Drawbacks**
-Are there any reasons why we should not do this? Here we aim to evaluate risk and check ourselves.
-
-Please consider:
-* is it a breaking change?
-* Impact on UX
-* implementation cost, both in terms of code size and complexity
-* integration of this feature with other existing and planned features
-
-
-## **Alternatives**
-What other designs have been considered? What is the impact of not doing this?
-
-
-## **Prior Art**
-Discuss prior art (both good and bad) in relation to this proposal:
-* Does this feature exist in other libraries? What experience has their community had?
-* What lessons can be learned from other implementations of this feature?
-* Published papers or great posts that discuss this
-
+In the suggested implementation, the prefetch_factor becomes more prominent.
+It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
+Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming
 
 ## **How we teach this**
-* What names and terminology work best for these concepts and why? How is this idea best presented?
-* Would the acceptance of this proposal mean the PyTorch documentation must be re-organized or altered?
-* How should this feature be taught to existing PyTorch users?
-
-
-## **Unresolved questions**
-* What parts of the design do you expect to resolve through the RFC process before this gets merged?
-* What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
-* What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
-
-
+* dataloader documentation should be updated to include:
+  * Add a new parameter: num_batch_workers
+  * Revise prefetch_factor parameter description
+  
 ## Resolution
 We decided to do it. X% of the engineering team actively approved of this change.
 

From 6f4f3e8cd59c6165927bf31b0d2838be4c7ab99b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:03:58 +0300
Subject: [PATCH 019/201] aa

---
 RFC-0000-dataloader-echonomic.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d3cfbfa2..45334194 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,7 +110,7 @@ By the new design, data will flow by the following order: main_process -> item_w
   * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
-The new flow should require significantly less shared memory, while preserving TPT. \
+The new flow should require significantly less shared memory, while preserving TPT (for the same num_workers, and using a large enough prefetch_factor). \
 To monitor shared memory usage, type in linux server terminal: \
 $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
@@ -118,7 +118,8 @@ and review /dev/shm "used" column.
 ## **Drawbacks**
 In the suggested implementation, the prefetch_factor becomes more prominent.
 It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
-Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming
+Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming.
+A larger default value for prefetch_factor, may be considered (for example 3 instead of 2).
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:

From 67d3e6ac84a051514b2b88ccc655c75eb122ea7e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:10:11 +0300
Subject: [PATCH 020/201] aa

---
 RFC-0000-dataloader-echonomic.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 45334194..2b47fc90 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -120,6 +120,7 @@ In the suggested implementation, the prefetch_factor becomes more prominent.
 It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
 Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming.
 A larger default value for prefetch_factor, may be considered (for example 3 instead of 2).
+Additionally, number of workers required for the same TPT will increase by num_batch_workers.
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:

From efc757d65f085a9b7c67158f2e4b4d6f61f62ed5 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:11:32 +0300
Subject: [PATCH 021/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2b47fc90..059e637b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -39,7 +39,7 @@ Model input batch may require significant amounts of RAM. For example, in video
 
 By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
 In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
-At most, [num_workers * prefetch_factor] may be stored in shared memory at the same time.
+At most, [num_workers * prefetch_factor] may be simultenously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\

From dce9bb9dfff575bf4a5317248f48d56bea72b8c4 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:12:32 +0300
Subject: [PATCH 022/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 059e637b..1109c5f5 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -42,7 +42,7 @@ In practice, about [num_workers] batches are simultenously stored in shared memo
 At most, [num_workers * prefetch_factor] may be simultenously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
-Storing about [num_workers] batches in shared memory, at the same time, imposes a limit over [num_workers]:\
+Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
 [num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
 This limitation can produce a bottleneck over training TPT, by not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, in order to increase num_workers, a severs with more RAM can be used, increaseing sever cost.

From a67e58e2fa0aba1d6ba459b001c44ab46559e3c2 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:13:27 +0300
Subject: [PATCH 023/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1109c5f5..bf6640a1 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -44,7 +44,7 @@ The main process operates in parallel to the workers, to extract one batch after
 
 Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
 [num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
-This limitation can produce a bottleneck over training TPT, by not allowing to increase num_workers, due to server's RAM limitations.
+This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, in order to increase num_workers, a severs with more RAM can be used, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested.

From 03526952bf5a3d5449620cea0d46a7811283bdb3 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:14:55 +0300
Subject: [PATCH 024/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index bf6640a1..ef5374d4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,7 +50,7 @@ Alternatively, in order to increase num_workers, a severs with more RAM can be u
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
-As in current implemnation, workers are constantly recieving items, and are not expected to enter idle state. Hence no TPT reduction is expected.
+As in current implemnation, workers are not expected to enter idle state. Hence no TPT reduction is expected.
 By introducing this improvement, the new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**

From 907f5e4fc50a363572308dfb53bdd9585392eb83 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:15:20 +0300
Subject: [PATCH 025/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ef5374d4..57169917 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,7 +51,7 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 As in current implemnation, workers are not expected to enter idle state. Hence no TPT reduction is expected.
-By introducing this improvement, the new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
+The new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**
 ### **Definitions**

From 693c4d4be0934e1137d7cbf2ec3691096b0c8d3c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:15:50 +0300
Subject: [PATCH 026/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 57169917..ed1ac4b7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,7 +51,7 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 As in current implemnation, workers are not expected to enter idle state. Hence no TPT reduction is expected.
-The new flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
+The suggested flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**
 ### **Definitions**

From f59538c316593283d5a3e094daced1f80bd672e8 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:25:50 +0300
Subject: [PATCH 027/201] aa

---
 RFC-0000-dataloader-echonomic.md | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ed1ac4b7..7f25c296 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -56,18 +56,16 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol               | description                                                                                                                                                  |
-|----------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| iw                   | items_worker (there are num_workers workers)                                                                                                                 |
-| bw                   | batch_worker                                                                                                                                                 |
-| index_queue[iw]      | A queue for each items_worker - used to send items index (and metadata) to item_workers. Main process is putting data, and items_worker[iw] is gettting data |
-| item_queue[ib]       | One queue for each batch_worker - used to retrive items from item_workers. All items workers are putting data, batch_worker[ib] is getting data              |
-| worker_result_queue  | One queue - used to send prepared batches back to main process. All batches workers are putting data, main process is getting data                           |
-| item_idx             | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                                                  |
-| batch_idx            | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                                               |
-| item_index           | Item's dataset index, as in dataset.__getitem__(index)                                                                                                       |
-| iw_idx               | Item_worker index (which item_worker is designated to process the item)                                                                                      |  
-| bw_idx               | Batch_worker index (which batch_worker is designated to process the item)                                                                                    |
+| symbol                | description                                                                                                                      |
+|-----------------------|:---------------------------------------------------------------------------------------------------------------------------------|
+| index_queue           | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker.  |
+| item_queue            | A queue used to send item from item_worker to batch_worker. There is a seperate queue to each batch_worker.                      |
+| worker_result_queue   | A queue used to send prepared batches from batch_workers to main process.                                                        |
+| item_idx              | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                      |
+| batch_idx             | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                   |
+| item_index            | Item's dataset index, as in dataset.__getitem__(index)                                                                           |
+| iw_idx                | Item_worker index (which item_worker is designated to process the item)                                                          |  
+| bw_idx                | Batch_worker index (which batch_worker is designated to process the item)                                                        |
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 

From 81455e4c087b67546d5bdf519b8e2ff3c6d23974 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:29:44 +0300
Subject: [PATCH 028/201] aa

---
 RFC-0000-dataloader-echonomic.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 7f25c296..81ee45fd 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -108,7 +108,7 @@ By the new design, data will flow by the following order: main_process -> item_w
   * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
-The new flow should require significantly less shared memory, while preserving TPT (for the same num_workers, and using a large enough prefetch_factor). \
+The new flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
 To monitor shared memory usage, type in linux server terminal: \
 $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
@@ -118,7 +118,8 @@ In the suggested implementation, the prefetch_factor becomes more prominent.
 It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
 Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming.
 A larger default value for prefetch_factor, may be considered (for example 3 instead of 2).
-Additionally, number of workers required for the same TPT will increase by num_batch_workers.
+
+Additionally, number of workers required for the same TPT increases by num_batch_workers.
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:

From efe328e24e2ce14212ffeff3bc98cb4a38f73c4c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:30:29 +0300
Subject: [PATCH 029/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 81ee45fd..a1dd7b1c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -124,7 +124,7 @@ Additionally, number of workers required for the same TPT increases by num_batch
 ## **How we teach this**
 * dataloader documentation should be updated to include:
   * Add a new parameter: num_batch_workers
-  * Revise prefetch_factor parameter description
+  * Revise parameter description: prefetch_factor
   
 ## Resolution
 We decided to do it. X% of the engineering team actively approved of this change.

From f1c86c4d588396f0912a3d9ece75dc9d7c116fde Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:32:51 +0300
Subject: [PATCH 030/201] aa

---
 RFC-0000-dataloader-echonomic.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a1dd7b1c..afdbb6de 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -116,8 +116,7 @@ and review /dev/shm "used" column.
 ## **Drawbacks**
 In the suggested implementation, the prefetch_factor becomes more prominent.
 It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
-Hence, this parameter should be set with more attention by the user. Especially when collate_fn is TPT consuming.
-A larger default value for prefetch_factor, may be considered (for example 3 instead of 2).
+Hence, this parameter should be set with more attention. Additionally, a larger default value may be considered (possibly 3 instead of 2).
 
 Additionally, number of workers required for the same TPT increases by num_batch_workers.
 

From c2681282835561ebc20b65233124cdf84b4a0849 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:33:31 +0300
Subject: [PATCH 031/201] aa

---
 RFC-0000-dataloader-echonomic.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index afdbb6de..4bf317b2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -114,11 +114,10 @@ $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
 
 ## **Drawbacks**
-In the suggested implementation, the prefetch_factor becomes more prominent.
+* In the suggested implementation, the prefetch_factor becomes more prominent.
 It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
 Hence, this parameter should be set with more attention. Additionally, a larger default value may be considered (possibly 3 instead of 2).
-
-Additionally, number of workers required for the same TPT increases by num_batch_workers.
+* Number of workers required for the same TPT increases by num_batch_workers.
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:

From c733ba010884bcb957ffdb2c928783137f01c408 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:36:30 +0300
Subject: [PATCH 032/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 4bf317b2..33083e88 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,7 +50,7 @@ Alternatively, in order to increase num_workers, a severs with more RAM can be u
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
-As in current implemnation, workers are not expected to enter idle state. Hence no TPT reduction is expected.
+As in current implemnation, workers are not expected to enter idle state, hence no TPT reduction is expected.
 The suggested flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**

From e176e3bbd4e927e18cc73b5846eec8f0ef458665 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:37:02 +0300
Subject: [PATCH 033/201] aa

---
 RFC-0000-dataloader-echonomic.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 33083e88..ca0d9a12 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,6 +51,7 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 As in current implemnation, workers are not expected to enter idle state, hence no TPT reduction is expected.
+
 The suggested flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
 
 ## **Proposed Implementation**

From d14bbb44f23076c42589e0ccee752a543837db69 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:38:11 +0300
Subject: [PATCH 034/201] aa

---
 RFC-0000-dataloader-echonomic.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ca0d9a12..2fd640d3 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,16 +57,16 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol                | description                                                                                                                      |
-|-----------------------|:---------------------------------------------------------------------------------------------------------------------------------|
-| index_queue           | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker.  |
-| item_queue            | A queue used to send item from item_worker to batch_worker. There is a seperate queue to each batch_worker.                      |
-| worker_result_queue   | A queue used to send prepared batches from batch_workers to main process.                                                        |
-| item_idx              | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                      |
-| batch_idx             | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                   |
-| item_index            | Item's dataset index, as in dataset.__getitem__(index)                                                                           |
-| iw_idx                | Item_worker index (which item_worker is designated to process the item)                                                          |  
-| bw_idx                | Batch_worker index (which batch_worker is designated to process the item)                                                        |
+| symbol                | description                                                                                                                     |
+|-----------------------|:--------------------------------------------------------------------------------------------------------------------------------|
+| index_queue           | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
+| item_queue            | A queue used to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| worker_result_queue   | A queue used to send prepared batches from batch_workers to main process.                                                       |
+| item_idx              | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                     |
+| batch_idx             | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                  |
+| item_index            | Item's dataset index, as in dataset.__getitem__(index)                                                                          |
+| iw_idx                | Item_worker index (which item_worker is designated to process the item)                                                         |  
+| bw_idx                | Batch_worker index (which batch_worker is designated to process the item)                                                       |
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 

From 461077c36f848cc7106fc7da04c607f2798ed632 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:39:20 +0300
Subject: [PATCH 035/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2fd640d3..e6e3cf69 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -65,8 +65,8 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 | item_idx              | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                     |
 | batch_idx             | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                  |
 | item_index            | Item's dataset index, as in dataset.__getitem__(index)                                                                          |
-| iw_idx                | Item_worker index (which item_worker is designated to process the item)                                                         |  
-| bw_idx                | Batch_worker index (which batch_worker is designated to process the item)                                                       |
+| iw_idx                | Item_worker index                                                                                                               
+| bw_idx                | Batch_worker index                                                                                                              
 
 
 By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 

From 297a2efe715af0260d5af58b764292000bb63498 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:39:58 +0300
Subject: [PATCH 036/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e6e3cf69..152a8e3a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -69,7 +69,7 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 | bw_idx                | Batch_worker index                                                                                                              
 
 
-By current design, the class _MultiProcessingDataLoaderIter has one level of [num_workers] workers. 
+By current design, one level of [num_workers] workers is used. 
 The main process sends [prefetch_factor] batches to each worker, by index_queues.
 Each worker prepares the batch, and send it back to the main process through worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.

From a0edc2afec569bf4375acea996db8a8c820ef89c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:40:16 +0300
Subject: [PATCH 037/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 152a8e3a..5c964547 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -69,7 +69,7 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 | bw_idx                | Batch_worker index                                                                                                              
 
 
-By current design, one level of [num_workers] workers is used. 
+By current design, one level of workers is used. 
 The main process sends [prefetch_factor] batches to each worker, by index_queues.
 Each worker prepares the batch, and send it back to the main process through worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.

From 018436bfe31137cbc4dcf4be0d83bb792ba89bf9 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:42:36 +0300
Subject: [PATCH 038/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5c964547..5578af0a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -70,8 +70,8 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 
 
 By current design, one level of workers is used. 
-The main process sends [prefetch_factor] batches to each worker, by index_queues.
-Each worker prepares the batch, and send it back to the main process through worker_result_queue.
+The main process sends [prefetch_factor] batches to each worker, by the worker's index_queue.
+Each worker prepares one batch at a time, and send it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 

From 9ee0122d1c893d4638b27c95e911fd003b1f83eb Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:43:32 +0300
Subject: [PATCH 039/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5578af0a..f129dd0a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -74,7 +74,7 @@ The main process sends [prefetch_factor] batches to each worker, by the worker's
 Each worker prepares one batch at a time, and send it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
-A new design for MultiProcessingDataLoaderIter class is suggested. In the suggested design, there are 2 levels of workers: 
+A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
 * item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory

From 62ea09ab84856f932e9c30ef70b1b0234572fce0 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:44:29 +0300
Subject: [PATCH 040/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f129dd0a..3a36722b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -69,7 +69,7 @@ The suggested flow is designated to reduce RAM related bottelnecks and/or requir
 | bw_idx                | Batch_worker index                                                                                                              
 
 
-By current design, one level of workers is used. 
+By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends [prefetch_factor] batches to each worker, by the worker's index_queue.
 Each worker prepares one batch at a time, and send it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.

From 1b6d0fba16a185fbb56521a3452cfa20218d7d42 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:45:59 +0300
Subject: [PATCH 041/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3a36722b..25a903e7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -75,7 +75,7 @@ Each worker prepares one batch at a time, and send it back to the main process b
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
+* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory by item_queue 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
 

From b6eef2a0b3159d2b74f1c816c5a6978519a8f810 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:46:36 +0300
Subject: [PATCH 042/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 25a903e7..a55e7940 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -77,7 +77,7 @@ After a batch is retrived by the main process, another batch is sent to the appr
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
 * item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory by item_queue 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
+* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, by worker_result_queue
 
 By the new design, data will flow by the following order: main_process -> item_workers -> batch_workers -> main_proces
 

From 26103433cfb6dc16fe01f215a916816b54af8279 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:47:03 +0300
Subject: [PATCH 043/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a55e7940..3a36722b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -75,9 +75,9 @@ Each worker prepares one batch at a time, and send it back to the main process b
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory by item_queue 
+* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, by worker_result_queue
+* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
 
 By the new design, data will flow by the following order: main_process -> item_workers -> batch_workers -> main_proces
 

From bf2af024287373d59796cc7306255f5c6c903b29 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:47:42 +0300
Subject: [PATCH 044/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3a36722b..88de93d8 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -79,7 +79,7 @@ A new multiprocessing pipline is suggested. In the suggested pipeine, there are
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
 
-By the new design, data will flow by the following order: main_process -> item_workers -> batch_workers -> main_proces
+The new design data flow, is accoroding to the following order: main_process -> item_workers -> batch_workers -> main_proces
 
 ### **main process high-level flow**
 * Send one item at a time to item_workers (by index_queues)

From c4f7c68216ad5148b665f168a93edc49189d8a43 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:49:29 +0300
Subject: [PATCH 045/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 88de93d8..b8fd41a7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -79,7 +79,11 @@ A new multiprocessing pipline is suggested. In the suggested pipeine, there are
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
 
-The new design data flow, is accoroding to the following order: main_process -> item_workers -> batch_workers -> main_proces
+
+### **dataflow**
+Current design dataflow: main_process -> workers -> main_process
+
+New design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 ### **main process high-level flow**
 * Send one item at a time to item_workers (by index_queues)

From d6afbda942caff006fb82fffd70e9a595cb01de4 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:49:42 +0300
Subject: [PATCH 046/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index b8fd41a7..c9f3d7bc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -83,7 +83,7 @@ A new multiprocessing pipline is suggested. In the suggested pipeine, there are
 ### **dataflow**
 Current design dataflow: main_process -> workers -> main_process
 
-New design dataflow: main_process -> item_workers -> batch_workers -> main_process
+Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 ### **main process high-level flow**
 * Send one item at a time to item_workers (by index_queues)

From f367699605f0a1059adf63da68815b1178b620dd Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:50:27 +0300
Subject: [PATCH 047/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c9f3d7bc..7fadd292 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -85,7 +85,7 @@ Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **main process high-level flow**
+### **Suggested main process high-level flow**
 * Send one item at a time to item_workers (by index_queues)
   * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
   * Track number of items at work ("work-load") for each worker.  
@@ -98,12 +98,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
 * Once the next required batch is retrived (by , return batch to caller function 
 
-### **items_worker main-loop flow**
+### **Suggested items_worker main-loop flow**
 * get item from index_queue
 * run dataset.\_\_getitem__(item_index)
 * send item to batch_worker by item_queue[bw_idx]
 
-### **batches_worker main-loop flow**
+### **Suggested batches_worker main-loop flow**
 * get items from item_queue
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 

From eb20e6d25f82806d72f028f1cbef19e6cbba369e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:51:13 +0300
Subject: [PATCH 048/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 7fadd292..093040fb 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -113,7 +113,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
-The new flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
+The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
 To monitor shared memory usage, type in linux server terminal: \
 $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.

From d7057912d56d5d961a4a9f20b6012313f5885628 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 00:53:58 +0300
Subject: [PATCH 049/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 093040fb..5c0f593b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -35,7 +35,7 @@ nor about implementing the described feature until some time in the future.
 A new pytorch dataloader multiprocessing pipline is suggested. This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications.
+Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
 
 By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
 In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
@@ -48,7 +48,7 @@ This limitation can produce a bottleneck over training TPT, not allowing to incr
 Alternatively, in order to increase num_workers, a severs with more RAM can be used, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested.
-In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together.
+In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 As in current implemnation, workers are not expected to enter idle state, hence no TPT reduction is expected.
 

From cb85c7bf03b2924a8b801005b8ec08a1adef45ec Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:00:39 +0300
Subject: [PATCH 050/201] aa

---
 RFC-0000-dataloader-echonomic.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5c0f593b..62214499 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,9 +50,12 @@ Alternatively, in order to increase num_workers, a severs with more RAM can be u
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
-As in current implemnation, workers are not expected to enter idle state, hence no TPT reduction is expected.
 
-The suggested flow is designated to reduce RAM related bottelnecks and/or requirements, and improve training costeffectiveness.
+As in current implemnation, items generating workers are not expected to enter idle state, hence no TPT reduction is expected.
+Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
+
+
+
 
 ## **Proposed Implementation**
 ### **Definitions**

From 06779a2af72aa44f0f8696c8a1fee39eed2d5ff2 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:01:30 +0300
Subject: [PATCH 051/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 62214499..7b2b6750 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -111,9 +111,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **Notes**
-* A new dataloader parameter: num_batch_workers should be introduced
-  * By default, this parameter should be set to prefetch_factor. 
-  * There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
+* A new dataloader parameter: num_batch_workers should be introduced. By default, this parameter should be set to prefetch_factor. There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From fb15e735d8af32d6ecf9f03455be07ce4ac6def0 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:13:37 +0300
Subject: [PATCH 052/201] aa

---
 RFC-0000-dataloader-echonomic.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 7b2b6750..334365dc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -99,7 +99,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
     * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending items when reaching this limit.
 * Retrive and store prepared batches from batch_workers (by worker_result_queue)
   * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
-* Once the next required batch is retrived (by , return batch to caller function 
+* Once the next required batch is retrived, return batch to caller function 
 
 ### **Suggested items_worker main-loop flow**
 * get item from index_queue
@@ -110,8 +110,10 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * get items from item_queue
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
-### **Notes**
-* A new dataloader parameter: num_batch_workers should be introduced. By default, this parameter should be set to prefetch_factor. There is no reason to use a larger value than prefetch_factor. However, smaller value may be considered by the user, if collate_fn is very fast
+### **New parameters**
+* A new dataloader parameter: num_batch_workers should be introduced. By default, this parameter should be set to prefetch_factor. 
+  * There is no reason to use a larger value than prefetch_factor
+  * An increase of prefetch_factor default value from 2 to 3 may be considered.
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -120,10 +122,9 @@ $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
 
 ## **Drawbacks**
-* In the suggested implementation, the prefetch_factor becomes more prominent.
-It determines the total number of items sent simultenously to all workers, and (by default) also determines num_workers_batches.
-Hence, this parameter should be set with more attention. Additionally, a larger default value may be considered (possibly 3 instead of 2).
+* Additional layer of batch_workers is required, somewhat increasing flow compexity.
 * Number of workers required for the same TPT increases by num_batch_workers.
+  
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:

From 304972bede44eb24f2e11acd1495a366fc4a3df1 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:42:45 +0300
Subject: [PATCH 053/201] aa

---
 RFC-0000-dataloader-echonomic.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 334365dc..ed5f9d2c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -113,7 +113,6 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 ### **New parameters**
 * A new dataloader parameter: num_batch_workers should be introduced. By default, this parameter should be set to prefetch_factor. 
   * There is no reason to use a larger value than prefetch_factor
-  * An increase of prefetch_factor default value from 2 to 3 may be considered.
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -123,7 +122,7 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow compexity.
-* Number of workers required for the same TPT increases by num_batch_workers.
+* Number of workers required for the same TPT increases by num_batches_workers (by default: num_batch_workers = prefetch_factor = 2).
   
 
 ## **How we teach this**

From aefc0d970be6724d910acb0dd1945273b8ff902b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:43:36 +0300
Subject: [PATCH 054/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ed5f9d2c..9f381bf7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -127,7 +127,7 @@ and review /dev/shm "used" column.
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:
-  * Add a new parameter: num_batch_workers
+  * Add a new parameter: num_batch_workers (which equals prefetch_factor by default)
   * Revise parameter description: prefetch_factor
   
 ## Resolution

From 2b5b2c9b4fbdcfb48db216669fb5ccf74759314f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:49:03 +0300
Subject: [PATCH 055/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 9f381bf7..e7f5ec0d 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -127,7 +127,9 @@ and review /dev/shm "used" column.
 
 ## **How we teach this**
 * dataloader documentation should be updated to include:
-  * Add a new parameter: num_batch_workers (which equals prefetch_factor by default)
+  * Add a new parameter: num_batch_workers
+    * Default value should be prefetch_factor
+    * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor by default"
   * Revise parameter description: prefetch_factor
   
 ## Resolution

From 691dffd90e87a3e875f93618ba292ee64ed733b1 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:50:44 +0300
Subject: [PATCH 056/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e7f5ec0d..37bb9fa7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -126,10 +126,10 @@ and review /dev/shm "used" column.
   
 
 ## **How we teach this**
-* dataloader documentation should be updated to include:
+* Dataloader documentation updates:
   * Add a new parameter: num_batch_workers
-    * Default value should be prefetch_factor
-    * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor by default"
+    * Default value should be num_batch_workers = prefetch_factor
+    * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
   * Revise parameter description: prefetch_factor
   
 ## Resolution

From 8b6334061edc799ef3667095e1af25ce51b18ee0 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:50:57 +0300
Subject: [PATCH 057/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 37bb9fa7..353001ae 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -130,7 +130,7 @@ and review /dev/shm "used" column.
   * Add a new parameter: num_batch_workers
     * Default value should be num_batch_workers = prefetch_factor
     * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
-  * Revise parameter description: prefetch_factor
+  * Adjust parameter description: prefetch_factor
   
 ## Resolution
 We decided to do it. X% of the engineering team actively approved of this change.

From 244ebfe8dc9185fdd18a7f52114d60568cec7ef9 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:52:52 +0300
Subject: [PATCH 058/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 353001ae..96253bb3 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -45,7 +45,7 @@ The main process operates in parallel to the workers, to extract one batch after
 Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
 [num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, in order to increase num_workers, a severs with more RAM can be used, increaseing sever cost.
+Alternatively, in order to increase num_workers, a severs with more RAM must be used, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.

From a8f2b88877d2471773473ad07755f06adc47d14f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:55:36 +0300
Subject: [PATCH 059/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 96253bb3..42b06df1 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -80,7 +80,7 @@ After a batch is retrived by the main process, another batch is sent to the appr
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
 * item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory
+* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 
 ### **dataflow**

From 6f5ce0bc244e789f31f9595a166b99ca20e09e4e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:57:27 +0300
Subject: [PATCH 060/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 42b06df1..5d35d010 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -98,7 +98,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
       * Select bw_idx of the batches_worker with the minimal work-load
     * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending items when reaching this limit.
 * Retrive and store prepared batches from batch_workers (by worker_result_queue)
-  * Make sure to reduce work-load for the relevant batch_worker and for each relevant batch_worker when retriving the batch
+  * Make sure to reduce work-load counter for the relevant batch_worker and for each relevant batch_worker when retriving the batch
 * Once the next required batch is retrived, return batch to caller function 
 
 ### **Suggested items_worker main-loop flow**

From 90e8b77774d9be98854223209d7b0185ab14fd40 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 01:58:31 +0300
Subject: [PATCH 061/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5d35d010..d36aa398 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -104,7 +104,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 ### **Suggested items_worker main-loop flow**
 * get item from index_queue
 * run dataset.\_\_getitem__(item_index)
-* send item to batch_worker by item_queue[bw_idx]
+* send item to the appropriate batch_worker by item_queue
 
 ### **Suggested batches_worker main-loop flow**
 * get items from item_queue

From de9847bd4d066ea3ba8936cebe873d8d5ddff312 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 02:02:10 +0300
Subject: [PATCH 062/201] aa

---
 RFC-0000-dataloader-echonomic.md | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d36aa398..a8cb639e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -60,16 +60,17 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol                | description                                                                                                                     |
-|-----------------------|:--------------------------------------------------------------------------------------------------------------------------------|
-| index_queue           | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
-| item_queue            | A queue used to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
-| worker_result_queue   | A queue used to send prepared batches from batch_workers to main process.                                                       |
-| item_idx              | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                     |
-| batch_idx             | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                  |
-| item_index            | Item's dataset index, as in dataset.__getitem__(index)                                                                          |
-| iw_idx                | Item_worker index                                                                                                               
-| bw_idx                | Batch_worker index                                                                                                              
+| symbol              | description                                                                                                                     |
+|---------------------|:--------------------------------------------------------------------------------------------------------------------------------|
+| index_queue         | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
+| item_queue          | A queue used to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| worker_result_queue | A queue used to send prepared batches from batch_workers to main process.                                                       |
+| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                     |
+| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                  |
+| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                          |
+| iw_idx              | Item_worker index                                                                                                               
+| bw_idx              | Batch_worker index                                                                                                              
+| batch_size          | batch size (may be smaller for last batch in epoch)                                                                        |
 
 
 By the current multiprocessing pipeline, a single level of workers is used. 
@@ -107,7 +108,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * send item to the appropriate batch_worker by item_queue
 
 ### **Suggested batches_worker main-loop flow**
-* get items from item_queue
+* get one item at a time from item_queue and append them into batches, by item batch_idx
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **New parameters**

From fe96da42e5defd15c58bf3929a48b9e2833c93c8 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sun, 15 Sep 2024 23:58:14 +0300
Subject: [PATCH 063/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a8cb639e..f3e61cb9 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -45,7 +45,7 @@ The main process operates in parallel to the workers, to extract one batch after
 Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
 [num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, in order to increase num_workers, a severs with more RAM must be used, increaseing sever cost.
+Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.

From e0ef89dd8b58c19b5855738f7342a537b766e0b3 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:00:20 +0300
Subject: [PATCH 064/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f3e61cb9..bf74859b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,7 +51,7 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 
-As in current implemnation, items generating workers are not expected to enter idle state, hence no TPT reduction is expected.
+As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected.
 Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 
 

From 8ba4f8715ec5e99ba8699eaf3e24efc15f6c6be7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:01:38 +0300
Subject: [PATCH 065/201] aa

---
 RFC-0000-dataloader-echonomic.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index bf74859b..d9d39dc2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -60,17 +60,17 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol              | description                                                                                                                     |
-|---------------------|:--------------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | A queue used to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
-| item_queue          | A queue used to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
-| worker_result_queue | A queue used to send prepared batches from batch_workers to main process.                                                       |
-| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                     |
-| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                                  |
-| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                          |
-| iw_idx              | Item_worker index                                                                                                               
-| bw_idx              | Batch_worker index                                                                                                              
-| batch_size          | batch size (may be smaller for last batch in epoch)                                                                        |
+| symbol              | description                                                                                                              |
+|---------------------|:-------------------------------------------------------------------------------------------------------------------------|
+| index_queue         | Queue to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
+| item_queue          | Queue to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| worker_result_queue | Queue to send prepared batches from batch_workers to main process.                                                       |
+| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                              |
+| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                           |
+| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                   |
+| iw_idx              | Item_worker index                                                                                                        
+| bw_idx              | Batch_worker index                                                                                                       
+| batch_size          | batch size (may be smaller for last batch in epoch)                                                                      |
 
 
 By the current multiprocessing pipeline, a single level of workers is used. 

From 7725db30b8d05ee3a8703c65ad4086deb66f5a47 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:02:43 +0300
Subject: [PATCH 066/201] aa

---
 RFC-0000-dataloader-echonomic.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d9d39dc2..81dd08bc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -60,17 +60,17 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol              | description                                                                                                              |
-|---------------------|:-------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | Queue to send item's index and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
-| item_queue          | Queue to send item from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
-| worker_result_queue | Queue to send prepared batches from batch_workers to main process.                                                       |
-| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                              |
-| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                           |
-| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                   |
-| iw_idx              | Item_worker index                                                                                                        
-| bw_idx              | Batch_worker index                                                                                                       
-| batch_size          | batch size (may be smaller for last batch in epoch)                                                                      |
+| symbol              | description                                                                                                               |
+|---------------------|:--------------------------------------------------------------------------------------------------------------------------|
+| index_queue         | Queue to send items indices and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
+| item_queue          | Queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| worker_result_queue | Queue to send prepared batches from batch_workers to main process.                                                        |
+| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                               |
+| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                            |
+| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                    |
+| iw_idx              | Item_worker index                                                                                                         
+| bw_idx              | Batch_worker index                                                                                                        
+| batch_size          | batch size (may be smaller for last batch in epoch)                                                                       |
 
 
 By the current multiprocessing pipeline, a single level of workers is used. 

From f73bcd919cceab95c58e5d97fe142ce220ee431d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:03:19 +0300
Subject: [PATCH 067/201] aa

---
 RFC-0000-dataloader-echonomic.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 81dd08bc..4e68c764 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -60,17 +60,17 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 ## **Proposed Implementation**
 ### **Definitions**
 
-| symbol              | description                                                                                                               |
-|---------------------|:--------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | Queue to send items indices and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
-| item_queue          | Queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
-| worker_result_queue | Queue to send prepared batches from batch_workers to main process.                                                        |
-| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                               |
-| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                            |
-| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                    |
-| iw_idx              | Item_worker index                                                                                                         
-| bw_idx              | Batch_worker index                                                                                                        
-| batch_size          | batch size (may be smaller for last batch in epoch)                                                                       |
+| symbol              | description                                                                                                                 |
+|---------------------|:----------------------------------------------------------------------------------------------------------------------------|
+| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
+| item_queue          | A queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
+| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                 |
+| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                              |
+| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                      |
+| iw_idx              | Item_worker index                                                                                                           
+| bw_idx              | Batch_worker index                                                                                                          
+| batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
 
 
 By the current multiprocessing pipeline, a single level of workers is used. 

From e2a97317678655e4428f4e3f8362935661d2d97f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:04:02 +0300
Subject: [PATCH 068/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 4e68c764..68739cdb 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -65,8 +65,8 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 | index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
 | item_queue          | A queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
 | worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
-| item_idx            | Item serial index from epoch start (0 for first item, 1 for next item, etc)                                                 |
-| batch_idx           | Batch serial index from epoch start (0 for first batch, 1 for next batch, etc)                                              |
+| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
+| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                |
 | item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                      |
 | iw_idx              | Item_worker index                                                                                                           
 | bw_idx              | Batch_worker index                                                                                                          

From bdf1e0a7acbd864b7689361ced0350545a4b4018 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:05:01 +0300
Subject: [PATCH 069/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 68739cdb..2e1ca137 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -74,7 +74,7 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends [prefetch_factor] batches to each worker, by the worker's index_queue.
+The main process sends [prefetch_factor] batches to each worker.
 Each worker prepares one batch at a time, and send it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 

From 156f3e8d18c9626718741739d531efa2832af654 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:05:56 +0300
Subject: [PATCH 070/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2e1ca137..cc47469e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -79,9 +79,9 @@ Each worker prepares one batch at a time, and send it back to the main process b
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - Designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - Designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 
 ### **dataflow**

From 34c76c266bcfd7ab2999d76d340f3d2c3c35bb5c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:06:50 +0300
Subject: [PATCH 071/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index cc47469e..62ca31f7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -84,7 +84,7 @@ A new multiprocessing pipline is suggested. In the suggested pipeine, there are
 * batch_workers - designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 
-### **dataflow**
+### **Data flow**
 Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process

From 8272220ac1782ad75596932df93ee1d62bb867f3 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:08:58 +0300
Subject: [PATCH 072/201] aa

---
 RFC-0000-dataloader-echonomic.md | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 62ca31f7..3ce8e9e6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -72,6 +72,7 @@ Additionally, the new flow is introducing only minor modifications in dataloader
 | bw_idx              | Batch_worker index                                                                                                          
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
 
+### **High level**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends [prefetch_factor] batches to each worker.
@@ -83,13 +84,11 @@ A new multiprocessing pipline is suggested. In the suggested pipeine, there are
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
-
-### **Data flow**
 Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **Suggested main process high-level flow**
+### **Main process flow**
 * Send one item at a time to item_workers (by index_queues)
   * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
   * Track number of items at work ("work-load") for each worker.  
@@ -102,12 +101,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Make sure to reduce work-load counter for the relevant batch_worker and for each relevant batch_worker when retriving the batch
 * Once the next required batch is retrived, return batch to caller function 
 
-### **Suggested items_worker main-loop flow**
+### **items_worker flow**
 * get item from index_queue
 * run dataset.\_\_getitem__(item_index)
 * send item to the appropriate batch_worker by item_queue
 
-### **Suggested batches_worker main-loop flow**
+### **batches_worker flow**
 * get one item at a time from item_queue and append them into batches, by item batch_idx
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 

From ccfc64c5bc554a00df535ce1c6bc78f35c45dc07 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:25:25 +0300
Subject: [PATCH 073/201] aa

---
 RFC-0000-dataloader-echonomic.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3ce8e9e6..cdd10c7f 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -88,25 +88,25 @@ Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **Main process flow**
-* Send one item at a time to item_workers (by index_queues)
-  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
-  * Track number of items at work ("work-load") for each worker.  
-    * A different iw_idx should be assigned to each item
-      * Select iw_idx of the items_worker with the minimal work-load
-    * An identical bw_idx should be assigned to all items in the same batch
-      * Select bw_idx of the batches_worker with the minimal work-load
-    * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending items when reaching this limit.
+### **Main process loop description**
 * Retrive and store prepared batches from batch_workers (by worker_result_queue)
-  * Make sure to reduce work-load counter for the relevant batch_worker and for each relevant batch_worker when retriving the batch
+  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retriving the batch
 * Once the next required batch is retrived, return batch to caller function 
+* Send batches of items to item_workers, one batch at a time
+  * A possibly different iw_idx should be assigned to each item
+    * Select iw_idx of the items_worker with the minimal work-load
+  * An identical bw_idx should be assigned to all items in the same batch
+    * Select bw_idx of the batches_worker with the minimal work-load
+  * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending batches when reaching this limit. 
+  * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
+  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
 
-### **items_worker flow**
+### **items_worker loop description**
 * get item from index_queue
 * run dataset.\_\_getitem__(item_index)
 * send item to the appropriate batch_worker by item_queue
 
-### **batches_worker flow**
+### **batches_worker loop description**
 * get one item at a time from item_queue and append them into batches, by item batch_idx
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 

From 10b663a6955b2d0c5c9b85c441e5ea9a18afafe4 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Mon, 16 Sep 2024 00:30:23 +0300
Subject: [PATCH 074/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index cdd10c7f..ee0b07f9 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -111,8 +111,10 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **New parameters**
-* A new dataloader parameter: num_batch_workers should be introduced. By default, this parameter should be set to prefetch_factor. 
+* A new dataloader parameter: num_batch_workers should be introduced 
+  * Default value should be num_batch_workers = prefetch_factor = 2
   * There is no reason to use a larger value than prefetch_factor
+  * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -128,8 +130,6 @@ and review /dev/shm "used" column.
 ## **How we teach this**
 * Dataloader documentation updates:
   * Add a new parameter: num_batch_workers
-    * Default value should be num_batch_workers = prefetch_factor
-    * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
   * Adjust parameter description: prefetch_factor
   
 ## Resolution

From 0ca3801315431425aae931f4fba2aa267b82baff Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Tue, 17 Sep 2024 21:32:36 +0300
Subject: [PATCH 075/201] aa

---
 RFC-0000-dataloader-echonomic.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ee0b07f9..1842f992 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -54,6 +54,9 @@ This decoupling from [num_workers], allowes to increase [num_workers], without a
 As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected.
 Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 
+Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
+Hence, epoch can potentially start faster, using the suggested implementation 
+
 
 
 

From ead9f2f6e54528ce88926d54c4c9474f2709ae9a Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:27:44 +0300
Subject: [PATCH 076/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1842f992..8a311049 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -43,7 +43,7 @@ At most, [num_workers * prefetch_factor] may be simultenously stored in shared m
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
-[num_workers < servers_total_available_ram_in_bytes / batch_size_in_bytes]\
+[num_workers < total_available_ram_in_bytes / batch_size_in_bytes]\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
 

From e7e14673076f25333fc47f9254ecbcd8831e274d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:29:37 +0300
Subject: [PATCH 077/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8a311049..1a91849d 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,12 +51,10 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 
-As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected.
-Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
-
-Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
+As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
 Hence, epoch can potentially start faster, using the suggested implementation 
 
+Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 
 
 

From 9e50057fa6a31069de413107ec854999ae9a2bbc Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:30:45 +0300
Subject: [PATCH 078/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1a91849d..b616240d 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,10 +51,12 @@ A new dataloader multiprocessing pipeline is suggested.
 In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
 This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
 
-As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
+As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
+
+Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
 Hence, epoch can potentially start faster, using the suggested implementation 
 
-Additionally, the new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
+The new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 
 
 

From 11f1420e14ffd97f2b76b73d2bc20459a314e3c4 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:32:13 +0300
Subject: [PATCH 079/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index b616240d..32d2a3f4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -54,7 +54,7 @@ This decoupling from [num_workers], allowes to increase [num_workers], without a
 As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
-Hence, epoch can potentially start faster, using the suggested implementation 
+Hence, using the suggested implementation, epoch can potentially start faster. 
 
 The new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 

From 6fa4a578ce28c82184871ff3607a1b4f2696e915 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:55:51 +0300
Subject: [PATCH 080/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 32d2a3f4..4d1d663c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -71,8 +71,8 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
 | batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                |
 | item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                      |
-| iw_idx              | Item_worker index                                                                                                           
-| bw_idx              | Batch_worker index                                                                                                          
+| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                                                           
+| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                                                          
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
 
 ### **High level**

From 35bec79ecf59b1c052be8c93b007797ffe424822 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 18:56:45 +0300
Subject: [PATCH 081/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 4d1d663c..ef2a5277 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -79,7 +79,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends [prefetch_factor] batches to each worker.
-Each worker prepares one batch at a time, and send it back to the main process by worker_result_queue.
+Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 

From 9b1a567711589f17f2cb1ecb9a5d1141e060781b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 19:01:48 +0300
Subject: [PATCH 082/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ef2a5277..c73e32ed 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -93,8 +93,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 ### **Main process loop description**
 * Retrive and store prepared batches from batch_workers (by worker_result_queue)
-  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retriving the batch
-* Once the next required batch is retrived, return batch to caller function 
+  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retriving the batch 
 * Send batches of items to item_workers, one batch at a time
   * A possibly different iw_idx should be assigned to each item
     * Select iw_idx of the items_worker with the minimal work-load
@@ -103,6 +102,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending batches when reaching this limit. 
   * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
+* Once the next required batch is retrived, return batch to caller function
 
 ### **items_worker loop description**
 * get item from index_queue

From 9c55d0ca1d18cdd9b9a34aedb1614a2534570cb0 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 18 Sep 2024 19:04:31 +0300
Subject: [PATCH 083/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c73e32ed..927f133b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -101,7 +101,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
     * Select bw_idx of the batches_worker with the minimal work-load
   * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending batches when reaching this limit. 
   * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
-  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx):
+  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrived, return batch to caller function
 
 ### **items_worker loop description**
@@ -110,7 +110,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * send item to the appropriate batch_worker by item_queue
 
 ### **batches_worker loop description**
-* get one item at a time from item_queue and append them into batches, by item batch_idx
+* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **New parameters**

From 82a463c2362f498a793c554592ef2a7a444a50d7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 11:52:20 +0300
Subject: [PATCH 084/201] aa

---
 RFC-0000-dataloader-echonomic.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 927f133b..0ff57031 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -69,10 +69,11 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | item_queue          | A queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
 | worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
 | item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
-| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                |
+| item_idx_in_batch   | Item serial index in batch                                                                                 |
+| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                      |
 | item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                      |
-| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                                                           
-| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                                                          
+| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                              
+| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                       
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
 
 ### **High level**
@@ -101,7 +102,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
     * Select bw_idx of the batches_worker with the minimal work-load
   * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending batches when reaching this limit. 
   * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
-  * Each item should include the following data: (item_idx, batch_idx, item_index, iw_idx, bw_idx, batch_size):
+  * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrived, return batch to caller function
 
 ### **items_worker loop description**

From 45855ca6be02b94bc3129a0e9cd90c9f63d0a6ab Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 12:06:00 +0300
Subject: [PATCH 085/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 0ff57031..40ea9b69 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -32,7 +32,9 @@ nor about implementing the described feature until some time in the future.
 * @yoadbs
 
 ## **Summary**
-A new pytorch dataloader multiprocessing pipline is suggested. This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
+A new PyTorch dataloader multiprocessing pipline is suggested. This pipline splits the batch generation, into 2 types of workers:\
+item generating workers (by dataset.__getitem__ function), and batch generating workers (by collate_fn).  
+This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
@@ -84,7 +86,7 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset \_\_getitem__ function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running dataset.\_\_getitem__ function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 

From 5447dedd34c96855b61f5a94697185721b0e8a31 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:18:19 +0300
Subject: [PATCH 086/201] aa

---
 RFC-0000-dataloader-echonomic.md | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 40ea9b69..3f24ff70 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -49,10 +49,10 @@ Simultenously storing about [num_workers] batches in shared memory, imposes a li
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
 
-A new dataloader multiprocessing pipeline is suggested.
-In this pipline, only up to [prefetch_factor] batches are simultenously processed by all the workers together, and sent into shared memory.
-This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption. 
-
+A new dataloader multiprocessing pipeline is suggested. In this pipline, there are two types of workers:
+item generating workers (by dataset.__getitem__ function), and batch generating workers (by collate_fn). 
+This design allows to simultenously process only up to [prefetch_factor] batches by all the workers together.
+This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption.
 As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
@@ -63,6 +63,17 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 
 ## **Proposed Implementation**
+The following dataloader input parameters were modified / added:
+
+| name                       | description                                                                |
+|----------------------------|----------------------------------------------------------------------------|
+| num_workers (modified)     | number of item workers                                                     |
+| prefetch_factor (modified) | number of batches sent for processing <u>by all workers</u> (2 by default) |
+| num_workers_batches (new)  | number of batch workers (default is prefetch_factor)                       |   
+
+
+
+
 ### **Definitions**
 
 | symbol              | description                                                                                                                 |

From abb5b2b05064f2cebdd688f64995857a79726e39 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:33:48 +0300
Subject: [PATCH 087/201] aa

---
 RFC-0000-dataloader-echonomic.md | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3f24ff70..3fdfd2c3 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -30,29 +30,29 @@ nor about implementing the described feature until some time in the future.
 
 **Authors:**
 * @yoadbs
-
+                                           
 ## **Summary**
-A new PyTorch dataloader multiprocessing pipline is suggested. This pipline splits the batch generation, into 2 types of workers:\
-item generating workers (by dataset.__getitem__ function), and batch generating workers (by collate_fn).  
+A new PyTorch dataloader multiprocessing pipline design is suggested. This pipline splits the task of batch generation, into 2 types of workers:\
+item generating workers (by dataset.`__getitem__` function), and batch generating workers (by collate_fn).  
 This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
 
 By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
-In practice, about [num_workers] batches are simultenously stored in shared memory, nearly after epoch start. 
-At most, [num_workers * prefetch_factor] may be simultenously stored in shared memory.
+In practice, about _num_workers_ batches are simultenously stored in shared memory, nearly after epoch start. 
+At most, _num_workers_ * _prefetch_factor_ may be simultenously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
-Simultenously storing about [num_workers] batches in shared memory, imposes a limit over [num_workers]:\
-[num_workers < total_available_ram_in_bytes / batch_size_in_bytes]\
+Simultenously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
+_num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipline, there are two types of workers:
-item generating workers (by dataset.__getitem__ function), and batch generating workers (by collate_fn). 
-This design allows to simultenously process only up to [prefetch_factor] batches by all the workers together.
-This decoupling from [num_workers], allowes to increase [num_workers], without any significant increase in shared memory consumption.
+item generating workers (by dataset.`__getitem__` function), and batch generating workers (by collate_fn). 
+This design allows to simultenously process only up to _prefetch_factor_ batches by all the workers together.
+This decoupling from _num_workers_, allowes to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
@@ -84,7 +84,7 @@ The following dataloader input parameters were modified / added:
 | item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
 | item_idx_in_batch   | Item serial index in batch                                                                                 |
 | batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                      |
-| item_index          | Item's dataset index, as in dataset.__getitem__(index)                                                                      |
+| item_index          | Item's dataset index, as in dataset.`__getitem__`(index)                                                                      |
 | iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                              
 | bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                       
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
@@ -92,14 +92,14 @@ The following dataloader input parameters were modified / added:
 ### **High level**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends [prefetch_factor] batches to each worker.
+The main process sends _prefetch_factor_ batches to each worker.
 Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset.\_\_getitem__ function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running dataset.`__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, collect [batch_size] items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 
@@ -113,14 +113,14 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
     * Select iw_idx of the items_worker with the minimal work-load
   * An identical bw_idx should be assigned to all items in the same batch
     * Select bw_idx of the batches_worker with the minimal work-load
-  * Make sure that the sum of item_workers work-load is always <= [prefetch_factor] * [batch_size]. Stop sending batches when reaching this limit. 
+  * Make sure that the sum of item_workers work-load is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
   * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrived, return batch to caller function
 
 ### **items_worker loop description**
 * get item from index_queue
-* run dataset.\_\_getitem__(item_index)
+* run dataset`.__getitem__`(item_index)
 * send item to the appropriate batch_worker by item_queue
 
 ### **batches_worker loop description**

From 803c6c35437232a0750a28f23030f502273b2dc1 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:39:57 +0300
Subject: [PATCH 088/201] aa

---
 RFC-0000-dataloader-echonomic.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3fdfd2c3..512fc63d 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -33,7 +33,7 @@ nor about implementing the described feature until some time in the future.
                                            
 ## **Summary**
 A new PyTorch dataloader multiprocessing pipline design is suggested. This pipline splits the task of batch generation, into 2 types of workers:\
-item generating workers (by dataset.`__getitem__` function), and batch generating workers (by collate_fn).  
+item generating workers (by calling `dataset.__getitem__` function), and batch generating workers (by calling `collate_fn`).  
 This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
@@ -50,7 +50,7 @@ This limitation can produce a bottleneck over training TPT, not allowing to incr
 Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipline, there are two types of workers:
-item generating workers (by dataset.`__getitem__` function), and batch generating workers (by collate_fn). 
+item generating workers (by `dataset.__getitem__` function), and batch generating workers (by collate_fn). 
 This design allows to simultenously process only up to _prefetch_factor_ batches by all the workers together.
 This decoupling from _num_workers_, allowes to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
@@ -82,9 +82,9 @@ The following dataloader input parameters were modified / added:
 | item_queue          | A queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
 | worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
 | item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
-| item_idx_in_batch   | Item serial index in batch                                                                                 |
+| item_idx_in_batch   | Item serial index in batch                                                                                                  |
 | batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                      |
-| item_index          | Item's dataset index, as in dataset.`__getitem__`(index)                                                                      |
+| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                |
 | iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                              
 | bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                       
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
@@ -97,7 +97,7 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrived by the main process, another batch is sent to the appropriate worker.
 
 A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running dataset.`__getitem__` function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
@@ -120,7 +120,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 ### **items_worker loop description**
 * get item from index_queue
-* run dataset`.__getitem__`(item_index)
+* run `dataset.__getitem__(item_index)`
 * send item to the appropriate batch_worker by item_queue
 
 ### **batches_worker loop description**

From 270cc46499c9222d54facb8a4decdb9af2771b92 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:53:34 +0300
Subject: [PATCH 089/201] aa

---
 RFC-0000-dataloader-echonomic.md | 46 ++++++++++++++++----------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 512fc63d..d37b25c9 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -32,28 +32,28 @@ nor about implementing the described feature until some time in the future.
 * @yoadbs
                                            
 ## **Summary**
-A new PyTorch dataloader multiprocessing pipline design is suggested. This pipline splits the task of batch generation, into 2 types of workers:\
+A new PyTorch dataloader multiprocessing pipeline design is suggested. This pipeline splits the task of batch generation, into 2 types of workers:\
 item generating workers (by calling `dataset.__getitem__` function), and batch generating workers (by calling `collate_fn`).  
-This pipline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
+This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
 
-By current dataloader multiprocessing pipline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
-In practice, about _num_workers_ batches are simultenously stored in shared memory, nearly after epoch start. 
-At most, _num_workers_ * _prefetch_factor_ may be simultenously stored in shared memory.
+By current dataloader multiprocessing pipeline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
+In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 
+At most, _num_workers_ * _prefetch_factor_ may be simultaneously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
-Simultenously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
+Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
 _num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, in order to increase num_workers, a severs with more RAM is required, increaseing sever cost.
+Alternatively, in order to increase num_workers, a severs with more RAM is required, increasing sever cost.
 
-A new dataloader multiprocessing pipeline is suggested. In this pipline, there are two types of workers:
+A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by collate_fn). 
-This design allows to simultenously process only up to _prefetch_factor_ batches by all the workers together.
+This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
 This decoupling from _num_workers_, allowes to increase _num_workers_, without any significant increase in shared memory consumption.
-As in current implemnation, the workers continuesly generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
+As in current implemenation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
 Hence, using the suggested implementation, epoch can potentially start faster. 
@@ -78,13 +78,13 @@ The following dataloader input parameters were modified / added:
 
 | symbol              | description                                                                                                                 |
 |---------------------|:----------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a seperate queue to each item_worker. |
-| item_queue          | A queue to send items from item_workers to batch_worker. There is a seperate queue to each batch_worker.                    |
+| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker. |
+| item_queue          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                    |
 | worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
-| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc)                                                         |
+| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                        |
 | item_idx_in_batch   | Item serial index in batch                                                                                                  |
-| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc)                                                      |
-| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                |
+| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                     |
+| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                    |
 | iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                              
 | bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                       
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
@@ -94,11 +94,11 @@ The following dataloader input parameters were modified / added:
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends _prefetch_factor_ batches to each worker.
 Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
-After a batch is retrived by the main process, another batch is sent to the appropriate worker.
+After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
-A new multiprocessing pipline is suggested. In the suggested pipeine, there are 2 levels of workers: 
+A new multiprocessing pipeline is suggested. In the suggested pipeine, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
-  * This worker is similar to the workers of the current design, but it recieves and sends one item at a time (and not one batch at a time) 
+  * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
@@ -106,8 +106,8 @@ Current design dataflow: main_process -> workers -> main_process
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 ### **Main process loop description**
-* Retrive and store prepared batches from batch_workers (by worker_result_queue)
-  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retriving the batch 
+* Retrieve and store prepared batches from batch_workers (by worker_result_queue)
+  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items to item_workers, one batch at a time
   * A possibly different iw_idx should be assigned to each item
     * Select iw_idx of the items_worker with the minimal work-load
@@ -116,7 +116,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Make sure that the sum of item_workers work-load is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
   * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
-* Once the next required batch is retrived, return batch to caller function
+* Once the next required batch is retrieved, return batch to caller function
 
 ### **items_worker loop description**
 * get item from index_queue
@@ -125,7 +125,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 ### **batches_worker loop description**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
-* Once all items of a given batch are recived, run collate_fn and send the prepared batch to worker_result_queue
+* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **New parameters**
 * A new dataloader parameter: num_batch_workers should be introduced 
@@ -135,7 +135,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
-To monitor shared memory usage, type in linux server terminal: \
+To monitor shared memory usage, type in Linux server terminal: \
 $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
 

From 1fe602aa73dd762e3d121fb0b7ef53e0bb2de06d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:56:19 +0300
Subject: [PATCH 090/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d37b25c9..093015f6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -39,7 +39,7 @@ This pipeline is designated to significantly reduce random-access-memory (RAM) u
 ## **Motivation**
 Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
 
-By current dataloader multiprocessing pipeline design, workers simultaneously prepere batches and send them into shared memory, by a queue.
+By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 
 At most, _num_workers_ * _prefetch_factor_ may be simultaneously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
@@ -52,8 +52,8 @@ Alternatively, in order to increase num_workers, a severs with more RAM is requi
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by collate_fn). 
 This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
-This decoupling from _num_workers_, allowes to increase _num_workers_, without any significant increase in shared memory consumption.
-As in current implemenation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
+This decoupling from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
+As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
 Hence, using the suggested implementation, epoch can potentially start faster. 

From 8d7cfcd696154b5c588854b1ccecb9a42fc7c189 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:57:15 +0300
Subject: [PATCH 091/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 093015f6..06f566ae 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -96,7 +96,7 @@ The main process sends _prefetch_factor_ batches to each worker.
 Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
-A new multiprocessing pipeline is suggested. In the suggested pipeine, there are 2 levels of workers: 
+A new multiprocessing pipeline is suggested. In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process

From 5cf04db2f7de5a7baccaa2d854188eeac7d87f22 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 13:59:31 +0300
Subject: [PATCH 092/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 06f566ae..6d365e5a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -47,7 +47,7 @@ The main process operates in parallel to the workers, to extract one batch after
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
 _num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, in order to increase num_workers, a severs with more RAM is required, increasing sever cost.
+Alternatively, to increase num_workers, a severs with more RAM is required, increasing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by collate_fn). 

From 8001cdcd0eda2f509cea0b7ee106d5007fe39854 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:00:39 +0300
Subject: [PATCH 093/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 6d365e5a..f84b6ade 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -98,7 +98,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 
 A new multiprocessing pipeline is suggested. In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
-  * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
+  * This worker is like the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From b68d6ca537d865e733c56d62606b45d2ab542e57 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:01:15 +0300
Subject: [PATCH 094/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f84b6ade..6d365e5a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -98,7 +98,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 
 A new multiprocessing pipeline is suggested. In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
-  * This worker is like the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
+  * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From 6df21b8d8d506cd0c309497d731a94082d4072e4 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:02:00 +0300
Subject: [PATCH 095/201] aa

---
 RFC-0000-dataloader-echonomic.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 6d365e5a..53fadb62 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -107,14 +107,14 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 ### **Main process loop description**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue)
-  * Track number of items at work ("work-load") by each worker. Make sure to reduce work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
+  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items to item_workers, one batch at a time
   * A possibly different iw_idx should be assigned to each item
-    * Select iw_idx of the items_worker with the minimal work-load
+    * Select iw_idx of the items_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
-    * Select bw_idx of the batches_worker with the minimal work-load
-  * Make sure that the sum of item_workers work-load is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
-  * Make sure to increase work-load counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
+    * Select bw_idx of the batches_worker with the minimal workload
+  * Make sure that the sum of item_workers workload is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
+  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrieved, return batch to caller function
 

From d5c8ee5e4412e587759eff141d3aec5265c99362 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:02:22 +0300
Subject: [PATCH 096/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 53fadb62..8839e8a0 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -131,7 +131,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * A new dataloader parameter: num_batch_workers should be introduced 
   * Default value should be num_batch_workers = prefetch_factor = 2
   * There is no reason to use a larger value than prefetch_factor
-  * If num_batch_workers > prefetch_factor, a warining should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
+  * If num_batch_workers > prefetch_factor, a warning should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 89c48ce93c96cd6e51dd325fcb553180337cee58 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:02:46 +0300
Subject: [PATCH 097/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8839e8a0..257f92a8 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -140,7 +140,7 @@ $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
 
 ## **Drawbacks**
-* Additional layer of batch_workers is required, somewhat increasing flow compexity.
+* Additional layer of batch_workers is required, somewhat increasing flow complexity.
 * Number of workers required for the same TPT increases by num_batches_workers (by default: num_batch_workers = prefetch_factor = 2).
   
 

From 63baac52a07efb93c2ea69bd372ecf773f0fff72 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:05:19 +0300
Subject: [PATCH 098/201] aa

---
 RFC-0000-dataloader-echonomic.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 257f92a8..3cae4110 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -76,18 +76,18 @@ The following dataloader input parameters were modified / added:
 
 ### **Definitions**
 
-| symbol              | description                                                                                                                 |
-|---------------------|:----------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker. |
-| item_queue          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                    |
-| worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                        |
-| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                        |
-| item_idx_in_batch   | Item serial index in batch                                                                                                  |
-| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                     |
-| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                    |
-| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                              
-| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                       
-| batch_size          | batch size (may be smaller for last batch in epoch)                                                                         |
+| symbol              | description                                                                                                                  |
+|---------------------|:-----------------------------------------------------------------------------------------------------------------------------|
+| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker.  |
+| item_queue          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                     |
+| worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                         |
+| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                         |
+| item_idx_in_batch   | Item serial index in batch                                                                                                   |
+| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                      |
+| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                     |
+| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                               |
+| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                        |
+| batch_size          | batch size (may be smaller for last batch in epoch)                                                                          |
 
 ### **High level**
 

From 0c61dc2269d5221424621ae0aeb0d47ca5dfd2f6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:06:48 +0300
Subject: [PATCH 099/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 3cae4110..bee1e6ba 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -33,7 +33,7 @@ nor about implementing the described feature until some time in the future.
                                            
 ## **Summary**
 A new PyTorch dataloader multiprocessing pipeline design is suggested. This pipeline splits the task of batch generation, into 2 types of workers:\
-item generating workers (by calling `dataset.__getitem__` function), and batch generating workers (by calling `collate_fn`).  
+item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`).  
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**

From 8133931c299c6eeb079e6f70831b0aa4e098f79b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:07:24 +0300
Subject: [PATCH 100/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index bee1e6ba..766cb539 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,7 +37,7 @@ item generating workers (by calling `dataset.__getitem__`), and batch generating
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-Model input batch may require significant amounts of RAM. For example, in video processing or in 3D graphics applications. 
+Model input batch may require large amounts of RAM. For example, in video processing or in 3D graphics applications. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 

From 0e83d002843a760a409660283d3f9bbe84ee7dd8 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:09:14 +0300
Subject: [PATCH 101/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 766cb539..c078758d 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,7 +37,7 @@ item generating workers (by calling `dataset.__getitem__`), and batch generating
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-Model input batch may require large amounts of RAM. For example, in video processing or in 3D graphics applications. 
+In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing or in 3D graphics. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 

From 9a3dc2d153c27799476dc2096421571b7e2e5eca Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:09:36 +0300
Subject: [PATCH 102/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c078758d..16bb624b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,7 +37,7 @@ item generating workers (by calling `dataset.__getitem__`), and batch generating
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing or in 3D graphics. 
+In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing and 3D graphics. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 

From 3da3e820d0681e095aad44fd79d151474effc130 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:10:13 +0300
Subject: [PATCH 103/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 16bb624b..380eeb7c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,7 +37,7 @@ item generating workers (by calling `dataset.__getitem__`), and batch generating
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing and 3D graphics. 
+In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing, 3D graphics, etc. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 

From 88c8404b17ca3ec5a2a3ef21f2c7da144dc9405e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:13:02 +0300
Subject: [PATCH 104/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 380eeb7c..a3988c39 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,7 +50,7 @@ This limitation can produce a bottleneck over training TPT, not allowing to incr
 Alternatively, to increase num_workers, a severs with more RAM is required, increasing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
-item generating workers (by `dataset.__getitem__` function), and batch generating workers (by collate_fn). 
+item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 
 This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
 This decoupling from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 

From 0416e78f55941d439b2b5769535d87c672dfa37d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:18:31 +0300
Subject: [PATCH 105/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a3988c39..5efb74ea 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,8 +51,8 @@ Alternatively, to increase num_workers, a severs with more RAM is required, incr
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 
-This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
-This decoupling from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
+This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together (no dependency in _num_workers_).
+The decoupling from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.

From 6c585658548f24ac00b21ab615944785b5b6d5c5 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:58:12 +0300
Subject: [PATCH 106/201] aa

---
 RFC-0000-dataloader-echonomic.md | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5efb74ea..2e746aee 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -51,8 +51,8 @@ Alternatively, to increase num_workers, a severs with more RAM is required, incr
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 
-This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together (no dependency in _num_workers_).
-The decoupling from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
+This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
+The decoupling of number of processed batches from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
@@ -60,19 +60,7 @@ Hence, using the suggested implementation, epoch can potentially start faster.
 
 The new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 
-
-
 ## **Proposed Implementation**
-The following dataloader input parameters were modified / added:
-
-| name                       | description                                                                |
-|----------------------------|----------------------------------------------------------------------------|
-| num_workers (modified)     | number of item workers                                                     |
-| prefetch_factor (modified) | number of batches sent for processing <u>by all workers</u> (2 by default) |
-| num_workers_batches (new)  | number of batch workers (default is prefetch_factor)                       |   
-
-
-
 
 ### **Definitions**
 
@@ -128,10 +116,15 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
 ### **New parameters**
-* A new dataloader parameter: num_batch_workers should be introduced 
-  * Default value should be num_batch_workers = prefetch_factor = 2
-  * There is no reason to use a larger value than prefetch_factor
-  * If num_batch_workers > prefetch_factor, a warning should be issued: "There is no benefit in setting num_batch_workers > prefetch_factor, please consider setting it to None. This would set num_batch_workers = prefetch_factor, by default"
+
+The following dataloader input parameters were modified / added:
+
+| name                       | description                                                                                                                                               |
+|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| num_workers (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+|                            |                                                                                                                                                           |
+| prefetch_factor (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                  |
+| num_workers_batches (new)  | number of batch workers (default is prefetch_factor). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -141,7 +134,7 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow complexity.
-* Number of workers required for the same TPT increases by num_batches_workers (by default: num_batch_workers = prefetch_factor = 2).
+* The user should consider increaseing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck 
   
 
 ## **How we teach this**

From 249397aec4dd9cd7dac1132871eaaf0780ae810b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:58:40 +0300
Subject: [PATCH 107/201] aa

---
 RFC-0000-dataloader-echonomic.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2e746aee..9b55bbcc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -119,12 +119,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 The following dataloader input parameters were modified / added:
 
-| name                       | description                                                                                                                                               |
-|----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| num_workers (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
-|                            |                                                                                                                                                           |
-| prefetch_factor (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                  |
-| num_workers_batches (new)  | number of batch workers (default is prefetch_factor). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
+| name                         | description                                                                                                                                               |
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+|                              |                                                                                                                                                           |
+| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                  |
+| _num_workers_batches_ (new)  | number of batch workers (default is prefetch_factor). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 85901f706c49ec15694e55f5087f6990e69fcdf7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 14:59:02 +0300
Subject: [PATCH 108/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 9b55bbcc..1ea6bc93 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -124,7 +124,7 @@ The following dataloader input parameters were modified / added:
 | _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
 |                              |                                                                                                                                                           |
 | _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                  |
-| _num_workers_batches_ (new)  | number of batch workers (default is prefetch_factor). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 211dc53a93e70c1733e040bfb6b191d78b30ff6c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:31:53 +0300
Subject: [PATCH 109/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1ea6bc93..565bdf58 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -40,7 +40,7 @@ This pipeline is designated to significantly reduce random-access-memory (RAM) u
 In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing, 3D graphics, etc. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
-In practice, about _num_workers_ batches are simultaneously stored in shared memory, nearly after epoch start. 
+In practice, about _num_workers_ prepared batches are simultaneously stored in shared memory, nearly after epoch start. 
 At most, _num_workers_ * _prefetch_factor_ may be simultaneously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 

From 06cf95ab1768d47067ac1f1eeaf5ff5c1d7f1c09 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:34:13 +0300
Subject: [PATCH 110/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 565bdf58..ffe43eb2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -47,7 +47,7 @@ The main process operates in parallel to the workers, to extract one batch after
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
 _num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, to increase num_workers, a severs with more RAM is required, increasing sever cost.
+Alternatively, to increase num_workers, a sever with more RAM is required, increasing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 

From ceb05ad3c5f6c4ea8b06a0ce4317f045d24d8f87 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:34:38 +0300
Subject: [PATCH 111/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ffe43eb2..486ccd17 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -47,7 +47,7 @@ The main process operates in parallel to the workers, to extract one batch after
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
 _num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
-Alternatively, to increase num_workers, a sever with more RAM is required, increasing sever cost.
+Alternatively, to increase num_workers, a sever with more available RAM is required, increasing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
 item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 

From d01b635990695ba2da453a01eb49d8c9d39ea21e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:41:19 +0300
Subject: [PATCH 112/201] aa

---
 RFC-0000-dataloader-echonomic.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 486ccd17..52d13000 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -77,7 +77,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                        |
 | batch_size          | batch size (may be smaller for last batch in epoch)                                                                          |
 
-### **High level**
+### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends _prefetch_factor_ batches to each worker.
@@ -93,7 +93,7 @@ Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **Main process loop description**
+### **Main Process**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue)
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items to item_workers, one batch at a time
@@ -106,12 +106,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrieved, return batch to caller function
 
-### **items_worker loop description**
+### **Item Worker**
 * get item from index_queue
 * run `dataset.__getitem__(item_index)`
 * send item to the appropriate batch_worker by item_queue
 
-### **batches_worker loop description**
+### **Batch Worker**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 

From e3b24af2db62c479ec05ec1ec21567f395bb8408 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:42:55 +0300
Subject: [PATCH 113/201] aa

---
 RFC-0000-dataloader-echonomic.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 52d13000..f82942e5 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -64,18 +64,18 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 ### **Definitions**
 
-| symbol              | description                                                                                                                  |
-|---------------------|:-----------------------------------------------------------------------------------------------------------------------------|
-| index_queue         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker.  |
-| item_queue          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                     |
-| worker_result_queue | A queue to send prepared batches from batch_workers to main process.                                                         |
-| item_idx            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                         |
-| item_idx_in_batch   | Item serial index in batch                                                                                                   |
-| batch_idx           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                      |
-| item_index          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                     |
-| iw_idx              | Item_worker index {0, 1, ..., num_workers - 1}                                                                               |
-| bw_idx              | Batch_worker index {0, 1, ..., num_batch_workers - 1}                                                                        |
-| batch_size          | batch size (may be smaller for last batch in epoch)                                                                          |
+| symbol                | description                                                                                                                 |
+|-----------------------|:----------------------------------------------------------------------------------------------------------------------------|
+| _index_queue_         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker. |
+| _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                    |
+| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process.                                                        |
+| _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                        |
+| _item_idx_in_batch_   | Item serial index in batch                                                                                                  |
+| _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                     |
+| _item_index_          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                    |
+| _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                            |
+| _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                     |
+| _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                         |
 
 ### **High Level Description**
 

From c273863db7f12c6d9deaca1a9701f1526b1d68c9 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:43:54 +0300
Subject: [PATCH 114/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f82942e5..00589e96 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -84,7 +84,7 @@ The main process sends _prefetch_factor_ batches to each worker.
 Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
-A new multiprocessing pipeline is suggested. In the suggested pipeline, there are 2 levels of workers: 
+In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process

From 5de494caca41782f7be7456616de89da530c4efb Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:44:30 +0300
Subject: [PATCH 115/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 00589e96..bb6ea6c0 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -86,7 +86,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
-  * This worker is similar to the workers of the current design, but it receives and sends one item at a time (and not one batch at a time) 
+  * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From b2b1d5c69f21bbf050d7eba0ce3bf7a4fe9d0ddf Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:47:07 +0300
Subject: [PATCH 116/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index bb6ea6c0..993429db 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -87,7 +87,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, collect _batch_size_ items, run collate function, and send the prepared batch back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send them back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 

From b4d349156df44b04b694caf9afe27c8c00016052 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:47:36 +0300
Subject: [PATCH 117/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 993429db..bf8f45d0 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -87,7 +87,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send them back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 

From bd0cab3586d165f6ccf60233d182524009ad2b49 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:48:35 +0300
Subject: [PATCH 118/201] aa

---
 RFC-0000-dataloader-echonomic.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index bf8f45d0..f6c2edc6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -93,7 +93,7 @@ Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **Main Process**
+### **Main Process Flow Description**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue)
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items to item_workers, one batch at a time
@@ -106,16 +106,16 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrieved, return batch to caller function
 
-### **Item Worker**
+### **Item Worker Flow Description**
 * get item from index_queue
 * run `dataset.__getitem__(item_index)`
 * send item to the appropriate batch_worker by item_queue
 
-### **Batch Worker**
+### **Batch Worker Flow Description**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
-### **New parameters**
+### **New Parameters**
 
 The following dataloader input parameters were modified / added:
 

From 6c8d8ef0a1ade6a18da14a32040193fd85d9eb6b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:49:56 +0300
Subject: [PATCH 119/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f6c2edc6..ccf446a4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -98,9 +98,9 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items to item_workers, one batch at a time
   * A possibly different iw_idx should be assigned to each item
-    * Select iw_idx of the items_worker with the minimal workload
+    * Select iw_idx by the items_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
-    * Select bw_idx of the batches_worker with the minimal workload
+    * Select bw_idx by the batches_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
   * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):

From 8b06c7840bada77401804161e5f33fb8c7dfcbbb Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:51:34 +0300
Subject: [PATCH 120/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ccf446a4..8124feea 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -45,7 +45,7 @@ At most, _num_workers_ * _prefetch_factor_ may be simultaneously stored in share
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\
-_num_workers_ < _total_available_ram_in_bytes_ / _batch_size_in_bytes_.\
+_num_workers_ < (_total_available_ram_in_bytes_ / _batch_size_in_bytes_) \
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, to increase num_workers, a sever with more available RAM is required, increasing sever cost.
 
@@ -101,7 +101,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
     * Select iw_idx by the items_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
     * Select bw_idx by the batches_worker with the minimal workload
-  * Make sure that the sum of item_workers workload is always <= _prefetch_factor_ * _batch_size_. Stop sending batches when reaching this limit. 
+  * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit. 
   * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
 * Once the next required batch is retrieved, return batch to caller function

From 6483e86931f58996e35312cac212b42c511a55cc Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:53:33 +0300
Subject: [PATCH 121/201] aa

---
 RFC-0000-dataloader-echonomic.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8124feea..76e6cea8 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -96,14 +96,15 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 ### **Main Process Flow Description**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue)
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
-* Send batches of items to item_workers, one batch at a time
+* Send batches of items metadata to item_workers, one batch at a time
+  * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
   * A possibly different iw_idx should be assigned to each item
     * Select iw_idx by the items_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
     * Select bw_idx by the batches_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit. 
   * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
-  * Each item should include the following data: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
+  
 * Once the next required batch is retrieved, return batch to caller function
 
 ### **Item Worker Flow Description**

From fb22db25e201d9a7ede7ac5f100f5d3e3e0ce0bc Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 15:58:58 +0300
Subject: [PATCH 122/201] aa

---
 RFC-0000-dataloader-echonomic.md | 52 ++++++++++++++++----------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 76e6cea8..1c4a7e7a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -70,7 +70,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                    |
 | _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process.                                                        |
 | _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                        |
-| _item_idx_in_batch_   | Item serial index in batch                                                                                                  |
+| _item_idx_in_batch_   | Item serial index in batch.                                                                                                 |
 | _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                     |
 | _item_index_          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                    |
 | _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                            |
@@ -85,47 +85,47 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory. 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process.
 
 Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 ### **Main Process Flow Description**
-* Retrieve and store prepared batches from batch_workers (by worker_result_queue)
-  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
-* Send batches of items metadata to item_workers, one batch at a time
+* Retrieve and store prepared batches from batch_workers (by worker_result_queue).
+  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch. 
+* Send batches of items metadata to item_workers, one batch at a time.
   * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
-  * A possibly different iw_idx should be assigned to each item
-    * Select iw_idx by the items_worker with the minimal workload
-  * An identical bw_idx should be assigned to all items in the same batch
-    * Select bw_idx by the batches_worker with the minimal workload
+  * A possibly different iw_idx should be assigned to each item.
+    * Select iw_idx by the items_worker with the minimal workload.
+  * An identical bw_idx should be assigned to all items in the same batch.
+    * Select bw_idx by the batches_worker with the minimal workload.
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit. 
-  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
+  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items.
   
-* Once the next required batch is retrieved, return batch to caller function
+* Once the next required batch is retrieved, return batch to caller function.
 
 ### **Item Worker Flow Description**
-* get item from index_queue
-* run `dataset.__getitem__(item_index)`
-* send item to the appropriate batch_worker by item_queue
+* get item from index_queue.
+* run `dataset.__getitem__(item_index)`.
+* send item to the appropriate batch_worker by item_queue.
 
 ### **Batch Worker Flow Description**
-* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
-* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
+* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size).
+* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue.
 
 ### **New Parameters**
 
 The following dataloader input parameters were modified / added:
 
-| name                         | description                                                                                                                                               |
-|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
-|                              |                                                                                                                                                           |
-| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                  |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
+| name                         | description                                                                                                                                                |
+|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_. |
+|                              |                                                                                                                                                            |
+| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default).                                                                  |
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_.                                     |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -135,13 +135,13 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow complexity.
-* The user should consider increaseing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck 
+* The user should consider increaseing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck. 
   
 
 ## **How we teach this**
 * Dataloader documentation updates:
-  * Add a new parameter: num_batch_workers
-  * Adjust parameter description: prefetch_factor
+  * Add a new parameter: num_batch_workers.
+  * Adjust parameter description: prefetch_factor.
   
 ## Resolution
 We decided to do it. X% of the engineering team actively approved of this change.

From 12f16fc4cf8726d82f6d30019e425a8c4c229740 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:00:33 +0300
Subject: [PATCH 123/201] aa

---
 RFC-0000-dataloader-echonomic.md | 36 ++++++++++++++++----------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1c4a7e7a..e7b20a84 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -64,18 +64,18 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 ### **Definitions**
 
-| symbol                | description                                                                                                                 |
-|-----------------------|:----------------------------------------------------------------------------------------------------------------------------|
-| _index_queue_         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker. |
-| _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker.                    |
-| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process.                                                        |
-| _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                        |
-| _item_idx_in_batch_   | Item serial index in batch.                                                                                                 |
-| _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                     |
-| _item_index_          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                    |
-| _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                            |
-| _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                     |
-| _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                         |
+| symbol                | description                                                                                                                   |
+|-----------------------|:------------------------------------------------------------------------------------------------------------------------------|
+| _index_queue_         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker    |
+| _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker                       |
+| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process                                                           |
+| _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                          |
+| _item_idx_in_batch_   | Item serial index in batch                                                                                                    |
+| _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                       |
+| _item_index_          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                      |
+| _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                              |
+| _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                       |
+| _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                           |
 
 ### **High Level Description**
 
@@ -120,12 +120,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 The following dataloader input parameters were modified / added:
 
-| name                         | description                                                                                                                                                |
-|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_. |
-|                              |                                                                                                                                                            |
-| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default).                                                                  |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_.                                     |   
+| name                         | description                                                                                                                                                 |
+|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_   |
+|                              |                                                                                                                                                             |
+| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                    |
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 3ae4de047909184575e0e7fc47fc510b93782328 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:02:12 +0300
Subject: [PATCH 124/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e7b20a84..19baf903 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -37,7 +37,7 @@ item generating workers (by calling `dataset.__getitem__`), and batch generating
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 
 ## **Motivation**
-In serveral applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing, 3D graphics, etc. 
+In several applications, the input batch of a PyTorch model may require large amounts of RAM. Such applications may include video processing, 3D graphics, etc. 
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ prepared batches are simultaneously stored in shared memory, nearly after epoch start. 

From 95510a4848f8808df62c07d08966e5d530475e23 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:03:45 +0300
Subject: [PATCH 125/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 19baf903..c3b6eb3e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -87,7 +87,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory. 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batchs by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process.
+* batch_workers - designated to get items from shared memory, prepare batches by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process.
 
 Current design dataflow: main_process -> workers -> main_process
 

From 68d6016421fbbcbee7ec2f9e42ced96dca038847 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:04:26 +0300
Subject: [PATCH 126/201] aa

---
 RFC-0000-dataloader-echonomic.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c3b6eb3e..ed0b90ff 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -120,12 +120,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 The following dataloader input parameters were modified / added:
 
-| name                         | description                                                                                                                                                 |
-|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_   |
-|                              |                                                                                                                                                             |
-| _prefetch_factor_ (modified) | number of batches simultanously sent for processing <u>by all workers</u> (2 by default)                                                                    |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
+| name                         | description                                                                                                                                               |
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+|                              |                                                                                                                                                           |
+| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 6a371bfaa099bdf8ead99d8f3b1dad561e65720b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:05:00 +0300
Subject: [PATCH 127/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ed0b90ff..1303c42f 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -135,7 +135,7 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow complexity.
-* The user should consider increaseing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck. 
+* The user should consider increasing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck. 
   
 
 ## **How we teach this**

From f8a1275248d59e9e57d2aa1a4039e3ba1cf1df5e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:11:31 +0300
Subject: [PATCH 128/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1303c42f..43e7c95a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -41,7 +41,7 @@ In several applications, the input batch of a PyTorch model may require large am
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ prepared batches are simultaneously stored in shared memory, nearly after epoch start. 
-At most, _num_workers_ * _prefetch_factor_ may be simultaneously stored in shared memory.
+At most, (_num_workers_ * _prefetch_factor_) may be simultaneously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\

From c612cdf74ea1a4113881bb28aba4a4c338f98288 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:12:06 +0300
Subject: [PATCH 129/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 43e7c95a..1cb06030 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -41,7 +41,7 @@ In several applications, the input batch of a PyTorch model may require large am
 
 By current dataloader multiprocessing pipeline design, workers simultaneously prepare batches and send them into shared memory, by a queue.
 In practice, about _num_workers_ prepared batches are simultaneously stored in shared memory, nearly after epoch start. 
-At most, (_num_workers_ * _prefetch_factor_) may be simultaneously stored in shared memory.
+At most, (_num_workers_ * _prefetch_factor_) prepared batches may be simultaneously stored in shared memory.
 The main process operates in parallel to the workers, to extract one batch after another, from shared memory, and inject it into the model for training/validation/test. 
 
 Simultaneously storing about _num_workers_ batches in shared memory, imposes a limit over _num_workers_:\

From 686d572ac6ade733323cd32a5990d0d15e8c2452 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:14:27 +0300
Subject: [PATCH 130/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1cb06030..36461768 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -50,7 +50,7 @@ This limitation can produce a bottleneck over training TPT, not allowing to incr
 Alternatively, to increase num_workers, a sever with more available RAM is required, increasing sever cost.
 
 A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
-item generating workers (by `dataset.__getitem__` function), and batch generating workers (by `collate_fn`). 
+item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`). 
 This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
 The decoupling of number of processed batches from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.
 As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 

From 81224d368311f66817ff66c2f1b8c089dcbf3e67 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:19:59 +0300
Subject: [PATCH 131/201] aa

---
 RFC-0000-dataloader-echonomic.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 36461768..d972b843 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -77,6 +77,17 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                       |
 | _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                           |
 
+### **New Parameters**
+
+The following dataloader input parameters were modified / added:
+
+| name                         | description                                                                                                                                               |
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+|                              |                                                                                                                                                           |
+| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
+
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
@@ -116,17 +127,6 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size).
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue.
 
-### **New Parameters**
-
-The following dataloader input parameters were modified / added:
-
-| name                         | description                                                                                                                                               |
-|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
-|                              |                                                                                                                                                           |
-| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
-
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
 To monitor shared memory usage, type in Linux server terminal: \

From bf15a58ab6bef7267d3b51c22e347abaa4bb6334 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:22:52 +0300
Subject: [PATCH 132/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d972b843..207ab0a0 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -104,7 +104,7 @@ Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
-### **Main Process Flow Description**
+#### **Main Process Flow**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue).
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch. 
 * Send batches of items metadata to item_workers, one batch at a time.
@@ -118,12 +118,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   
 * Once the next required batch is retrieved, return batch to caller function.
 
-### **Item Worker Flow Description**
+#### **Item Worker Flow**
 * get item from index_queue.
 * run `dataset.__getitem__(item_index)`.
 * send item to the appropriate batch_worker by item_queue.
 
-### **Batch Worker Flow Description**
+#### **Batch Worker Flow**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size).
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue.
 

From 9d4fb6f3f17a07a9a533340b13124836b923f062 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:24:11 +0300
Subject: [PATCH 133/201] aa

---
 RFC-0000-dataloader-echonomic.md | 42 ++++++++++++++++----------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 207ab0a0..c7b59f55 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -96,36 +96,36 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory. 
+* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batches by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process.
+* batch_workers - designated to get items from shared memory, prepare batches by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 #### **Main Process Flow**
-* Retrieve and store prepared batches from batch_workers (by worker_result_queue).
-  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch. 
-* Send batches of items metadata to item_workers, one batch at a time.
+* Retrieve and store prepared batches from batch_workers (by worker_result_queue)
+  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
+* Send batches of items metadata to item_workers, one batch at a time
   * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
-  * A possibly different iw_idx should be assigned to each item.
-    * Select iw_idx by the items_worker with the minimal workload.
-  * An identical bw_idx should be assigned to all items in the same batch.
-    * Select bw_idx by the batches_worker with the minimal workload.
-  * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit. 
-  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items.
+  * A possibly different iw_idx should be assigned to each item
+    * Select iw_idx by the items_worker with the minimal workload
+  * An identical bw_idx should be assigned to all items in the same batch
+    * Select bw_idx by the batches_worker with the minimal workload
+  * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit 
+  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   
-* Once the next required batch is retrieved, return batch to caller function.
+* Once the next required batch is retrieved, return batch to caller function
 
 #### **Item Worker Flow**
-* get item from index_queue.
-* run `dataset.__getitem__(item_index)`.
-* send item to the appropriate batch_worker by item_queue.
+* get item from index_queue
+* run `dataset.__getitem__(item_index)`
+* send item to the appropriate batch_worker by item_queue
 
 #### **Batch Worker Flow**
-* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size).
-* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue.
+* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
+* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -134,14 +134,14 @@ $ monitor -n0.1 df -h \
 and review /dev/shm "used" column.
 
 ## **Drawbacks**
-* Additional layer of batch_workers is required, somewhat increasing flow complexity.
-* The user should consider increasing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck. 
+* Additional layer of batch_workers is required, somewhat increasing flow complexity
+* The user should consider increasing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck 
   
 
 ## **How we teach this**
 * Dataloader documentation updates:
-  * Add a new parameter: num_batch_workers.
-  * Adjust parameter description: prefetch_factor.
+  * Add a new parameter: num_batch_workers
+  * Adjust parameter description: prefetch_factor
   
 ## Resolution
 We decided to do it. X% of the engineering team actively approved of this change.

From 4bb8057288fd2e5d87df8070730ee61519a0ca34 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:24:54 +0300
Subject: [PATCH 134/201] aa

---
 RFC-0000-dataloader-echonomic.md | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c7b59f55..d21dd82b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -77,17 +77,6 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 | _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                       |
 | _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                           |
 
-### **New Parameters**
-
-The following dataloader input parameters were modified / added:
-
-| name                         | description                                                                                                                                               |
-|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
-|                              |                                                                                                                                                           |
-| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
-
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
@@ -127,6 +116,16 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
+#### **New Parameters**
+The following dataloader input parameters were modified / added:
+
+| name                         | description                                                                                                                                               |
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+|                              |                                                                                                                                                           |
+| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
+| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
+
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
 To monitor shared memory usage, type in Linux server terminal: \

From 005bdc8c6bdd42adc7c57ab545ae019db3088ab9 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 16:29:54 +0300
Subject: [PATCH 135/201] aa

---
 RFC-0000-dataloader-echonomic.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d21dd82b..527bb8ba 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -134,7 +134,8 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow complexity
-* The user should consider increasing _prefetch_factor_, if `collate_fn` is very slow and becomes a bottleneck 
+* CPU usage is somewhat higher in the suggested flow, due to the additional _num_workers_batches_ processes 
+* The user should be aware that if `collate_fn` is very slow and becomes a bottleneck, an increase in _prefetch_factor_ should be considered 
   
 
 ## **How we teach this**

From 58495e9961ea991c5e2e87eb620d905f97de3342 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:41:03 +0300
Subject: [PATCH 136/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 527bb8ba..6c41ed2b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -26,13 +26,13 @@ nor about implementing the described feature until some time in the future.
 
 
 
-# Pytorch-DataLoader-Economic
+# PyTorch-DataLoader-Economic
 
 **Authors:**
 * @yoadbs
                                            
 ## **Summary**
-A new PyTorch dataloader multiprocessing pipeline design is suggested. This pipeline splits the task of batch generation, into 2 types of workers:\
+A new dataloader multiprocessing pipeline design is suggested. This pipeline splits the task of batch generation, into 2 types of workers:\
 item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`).  
 This pipeline is designated to significantly reduce random-access-memory (RAM) usage, without any significant reduction in throughput (TPT).
 

From 2dcd339ac27997a12df23d8dc0d32ec917d1ce5d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:43:30 +0300
Subject: [PATCH 137/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 6c41ed2b..a14b46a7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -49,7 +49,7 @@ _num_workers_ < (_total_available_ram_in_bytes_ / _batch_size_in_bytes_) \
 This limitation can produce a bottleneck over training TPT, not allowing to increase num_workers, due to server's RAM limitations.
 Alternatively, to increase num_workers, a sever with more available RAM is required, increasing sever cost.
 
-A new dataloader multiprocessing pipeline is suggested. In this pipeline, there are two types of workers:
+A new dataloader multiprocessing pipeline design is suggested. In this pipeline, there are two types of workers:
 item generating workers (by calling `dataset.__getitem__`), and batch generating workers (by calling `collate_fn`). 
 This design allows to simultaneously process only up to _prefetch_factor_ batches by all the workers together.
 The decoupling of number of processed batches from _num_workers_, allows to increase _num_workers_, without any significant increase in shared memory consumption.

From 8df058eb2d23145f422fd8283d9e48ce7a216ad7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:45:40 +0300
Subject: [PATCH 138/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a14b46a7..180b8fde 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -56,7 +56,7 @@ The decoupling of number of processed batches from _num_workers_, allows to incr
 As in current implementation, the workers continuously generate items during epoch, and are not expected to enter idle state. Hence no TPT reduction is expected. 
 
 Another smaller advantage is that in the proposed implementation, the first batch in each epoch is generated by multiple workers, while in current implementation it is generated by a single worker.
-Hence, using the suggested implementation, epoch can potentially start faster. 
+Hence, epoch can potentially start faster. 
 
 The new flow is introducing only minor modifications in dataloader interface, making the transition almost transparent to the user.
 

From f7a004ca176eca6344126cf841eccd68068f40d0 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:48:43 +0300
Subject: [PATCH 139/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 180b8fde..13e1e85c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -85,7 +85,7 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running `dataset.__getitem__` function), and send it to shared memory 
+* item_workers - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, prepare batches by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process
 

From cdf56f8f1dd7fcd271eb25218773e6f1bf52594f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:49:25 +0300
Subject: [PATCH 140/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 13e1e85c..618955ab 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -87,7 +87,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
   * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batches by running collate_fn, and send the prepared batches back to shared memory, for consumption by the main process
+* batch_workers - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 

From d7395436b7f3a8dcba3cd76b4459ffc502896f2d Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:51:26 +0300
Subject: [PATCH 141/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 618955ab..12d56e99 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -99,9 +99,9 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Send batches of items metadata to item_workers, one batch at a time
   * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
   * A possibly different iw_idx should be assigned to each item
-    * Select iw_idx by the items_worker with the minimal workload
+    * Select iw_idx by the item_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
-    * Select bw_idx by the batches_worker with the minimal workload
+    * Select bw_idx by the batch_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit 
   * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
   

From 78d07694c20aae45270c868a73c9aa8cfb3acc09 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:56:46 +0300
Subject: [PATCH 142/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 12d56e99..ef063445 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -96,7 +96,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **Main Process Flow**
 * Retrieve and store prepared batches from batch_workers (by worker_result_queue)
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
-* Send batches of items metadata to item_workers, one batch at a time
+* Send batches of items for preparation to item_workers, one batch at a time
   * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
   * A possibly different iw_idx should be assigned to each item
     * Select iw_idx by the item_worker with the minimal workload

From 5c60b34f4ed7b5f65a17c1ad1a288c78b107b6e6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:58:33 +0300
Subject: [PATCH 143/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ef063445..89b0671e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -103,7 +103,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * An identical bw_idx should be assigned to all items in the same batch
     * Select bw_idx by the batch_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit 
-  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item-workers, when sending the batch of items
+  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item_workers, when sending the batch of items
   
 * Once the next required batch is retrieved, return batch to caller function
 

From b1fd6bfa568bb29e87ca6e8b9256721a275a8247 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 20:59:49 +0300
Subject: [PATCH 144/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 89b0671e..cba70de6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -107,12 +107,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   
 * Once the next required batch is retrieved, return batch to caller function
 
-#### **Item Worker Flow**
+#### **Item_worker Flow**
 * get item from index_queue
 * run `dataset.__getitem__(item_index)`
 * send item to the appropriate batch_worker by item_queue
 
-#### **Batch Worker Flow**
+#### **Batch_worker Flow**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 

From 4729473d3860466d2e9a3a27b826035314e28bff Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:00:31 +0300
Subject: [PATCH 145/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index cba70de6..0affe0f6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -121,10 +121,10 @@ The following dataloader input parameters were modified / added:
 
 | name                         | description                                                                                                                                               |
 |------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
+| _num_workers_ (modified)     | number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
 |                              |                                                                                                                                                           |
 | _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
-| _num_workers_batches_ (new)  | number of batch workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
+| _num_workers_batches_ (new)  | number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 88416417793926de913205e639663ca8e256b5db Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:03:28 +0300
Subject: [PATCH 146/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 0affe0f6..a871a7b5 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,7 +110,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **Item_worker Flow**
 * get item from index_queue
 * run `dataset.__getitem__(item_index)`
-* send item to the appropriate batch_worker by item_queue
+* send item to the appropriate batch_worker (by item's bw_idx)
 
 #### **Batch_worker Flow**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)

From 6a761bdc4c0a67a656509c9288695802318fbe77 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:04:39 +0300
Subject: [PATCH 147/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a871a7b5..63fafb05 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,7 +110,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **Item_worker Flow**
 * get item from index_queue
 * run `dataset.__getitem__(item_index)`
-* send item to the appropriate batch_worker (by item's bw_idx)
+* send item to the appropriate item_queue (by item's bw_idx)
 
 #### **Batch_worker Flow**
 * get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)

From 4a42420790de2bab0adb8023d84aa9428b85907b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:05:08 +0300
Subject: [PATCH 148/201] aa

---
 RFC-0000-dataloader-echonomic.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 63fafb05..ba12b2c5 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -108,12 +108,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once the next required batch is retrieved, return batch to caller function
 
 #### **Item_worker Flow**
-* get item from index_queue
-* run `dataset.__getitem__(item_index)`
-* send item to the appropriate item_queue (by item's bw_idx)
+* Get item from index_queue
+* Run `dataset.__getitem__(item_index)`
+* Send item to the appropriate item_queue (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
+* Get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
 #### **New Parameters**

From 057c2c6e296acf17ab99d3ea2dfa9794ef5c4533 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:06:23 +0300
Subject: [PATCH 149/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ba12b2c5..6e6a4615 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -113,7 +113,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Send item to the appropriate item_queue (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* Get one item at a time from item_queue and append them into batches, by item batch_idx (and batch_size)
+* Get one item at a time from item_queue and collect them into batches, by item batch_idx (and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
 
 #### **New Parameters**

From a5963d18fc9713dfa09285a470b2f19ae45e593e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:08:38 +0300
Subject: [PATCH 150/201] aa

---
 RFC-0000-dataloader-echonomic.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 6e6a4615..d0a9020e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -81,7 +81,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends _prefetch_factor_ batches to each worker.
-Each worker prepares one batch at a time, and sends it back to the main process by worker_result_queue.
+Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
 In the suggested pipeline, there are 2 levels of workers: 
@@ -94,10 +94,10 @@ Current design dataflow: main_process -> workers -> main_process
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 #### **Main Process Flow**
-* Retrieve and store prepared batches from batch_workers (by worker_result_queue)
+* Retrieve and store prepared batches from batch_workers (by _worker_result_queue_)
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items for preparation to item_workers, one batch at a time
-  * Each item should include the following metadata: (item_idx_in_batch, batch_idx, item_index, iw_idx, bw_idx, batch_size):
+  * Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
   * A possibly different iw_idx should be assigned to each item
     * Select iw_idx by the item_worker with the minimal workload
   * An identical bw_idx should be assigned to all items in the same batch
@@ -108,13 +108,13 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once the next required batch is retrieved, return batch to caller function
 
 #### **Item_worker Flow**
-* Get item from index_queue
+* Get item from _index_queue_
 * Run `dataset.__getitem__(item_index)`
-* Send item to the appropriate item_queue (by item's bw_idx)
+* Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* Get one item at a time from item_queue and collect them into batches, by item batch_idx (and batch_size)
-* Once all items of a given batch are received, run collate_fn and send the prepared batch to worker_result_queue
+* Get one item at a time from _item_queue_ and collect them into batches, by item batch_idx (and batch_size)
+* Once all items of a given batch are received, run collate_fn and send the prepared batch to _worker_result_queue_
 
 #### **New Parameters**
 The following dataloader input parameters were modified / added:

From 5acf4305e7769a84663cd2743320a93732324488 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:13:19 +0300
Subject: [PATCH 151/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d0a9020e..5e3b591a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -94,13 +94,13 @@ Current design dataflow: main_process -> workers -> main_process
 Suggested design dataflow: main_process -> item_workers -> batch_workers -> main_process
 
 #### **Main Process Flow**
-* Retrieve and store prepared batches from batch_workers (by _worker_result_queue_)
+* Retrieve and store prepared batches from _worker_result_queue_
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
 * Send batches of items for preparation to item_workers, one batch at a time
   * Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
-  * A possibly different iw_idx should be assigned to each item
+  * A possibly different item_worker should be assigned to each item
     * Select iw_idx by the item_worker with the minimal workload
-  * An identical bw_idx should be assigned to all items in the same batch
+  * The same batch_worker should be assigned to all items in the same batch
     * Select bw_idx by the batch_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit 
   * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item_workers, when sending the batch of items

From 692d2adec48ab3ba03e2ad906e86bcd585add7f3 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:15:39 +0300
Subject: [PATCH 152/201] aa

---
 RFC-0000-dataloader-echonomic.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5e3b591a..f64c4dd2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -103,9 +103,8 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
   * The same batch_worker should be assigned to all items in the same batch
     * Select bw_idx by the batch_worker with the minimal workload
   * Make sure that the sum of item_workers workload is always <= (_prefetch_factor_ * _batch_size_). Stop sending batches when reaching this limit 
-  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item_workers, when sending the batch of items
-  
-* Once the next required batch is retrieved, return batch to caller function
+  * Make sure to increase workload counter for the relevant batch_worker, and for each of the relevant item_workers, when sending the batch of items  
+* Once the next required batch is available (by _batch_idx_), return batch to caller function
 
 #### **Item_worker Flow**
 * Get item from _index_queue_

From 22ea48eb9a6ccd2b1aea652855298e66aa73a2cd Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:18:14 +0300
Subject: [PATCH 153/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f64c4dd2..7195b60c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -96,7 +96,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **Main Process Flow**
 * Retrieve and store prepared batches from _worker_result_queue_
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
-* Send batches of items for preparation to item_workers, one batch at a time
+* Send batches of items for preparation to _index_queues_, one batch at a time
   * Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
   * A possibly different item_worker should be assigned to each item
     * Select iw_idx by the item_worker with the minimal workload

From 5a255f3130bb6f186d99aa386affa758176146ac Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:19:26 +0300
Subject: [PATCH 154/201] aa

---
 RFC-0000-dataloader-echonomic.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 7195b60c..5e715b3b 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -118,12 +118,12 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **New Parameters**
 The following dataloader input parameters were modified / added:
 
-| name                         | description                                                                                                                                               |
-|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond _prefetch_factor_ * _batch_size_ |
-|                              |                                                                                                                                                           |
-| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                 |
-| _num_workers_batches_ (new)  | number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                     |   
+| name                         | description                                                                                                                                                 |
+|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| _num_workers_ (modified)     | number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
+|                              |                                                                                                                                                             |
+| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
+| _num_workers_batches_ (new)  | number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
 ## **Metrics **
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 3c9b9c43889df2f7d9181546bfddc11c130087f6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:25:21 +0300
Subject: [PATCH 155/201] aa

---
 RFC-0000-dataloader-echonomic.md | 34 ++------------------------------
 1 file changed, 2 insertions(+), 32 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5e715b3b..e996045e 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -138,35 +138,5 @@ and review /dev/shm "used" column.
   
 
 ## **How we teach this**
-* Dataloader documentation updates:
-  * Add a new parameter: num_batch_workers
-  * Adjust parameter description: prefetch_factor
-  
-## Resolution
-We decided to do it. X% of the engineering team actively approved of this change.
-
-### Level of Support
-Choose one of the following:
-* 1: Overwhelming positive feedback.
-* 2: Positive feedback.
-* 3: Majority Acceptance, with conflicting Feedback.
-* 4: Acceptance, with Little Feedback.
-* 5: Unclear Resolution.
-* 6: RFC Rejected.
-* 7: RFC Rejected, with Conflicting Feedback.
-
-
-#### Additional Context
-Some people were in favor of it, but some people didn’t want it for project X.
-
-
-### Next Steps
-Will implement it. 
-
-
-#### Tracking issue
-<github issue URL>
-
-
-#### Exceptions
-Not implementing on project X now. Will revisit the decision in 1 year.
+Update Dataloader documentation to include the description of the suggested pipeline. 
+Add/update description of the new/modified parameters.
\ No newline at end of file

From dcb1dee8f7ea317009c03a3dce82aee7bb2b5145 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:34:44 +0300
Subject: [PATCH 156/201] aa

---
 RFC-0000-dataloader-echonomic.md | 28 ----------------------------
 1 file changed, 28 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e996045e..9ee2a9d4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,31 +1,3 @@
-
-
-<details>
-<summary>Instructions - click to expand</summary>
-
-- Fork the rfcs repo: https://github.com/pytorch/rfcs
-- Copy `RFC-0000-template.md` to `RFC-00xx-my-feature.md`, or write your own open-ended proposal. Put care into the details.
-- Submit a pull request titled `RFC-00xx-my-feature`. 
-    - Assign the `draft` label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR.
-    - When opening the RFC for general discussion, copy your document into the `RFC-00xx-my-feature.md` file on the PR and assign the `commenting` label.
-- Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a [resolution template](https://github.com/pytorch/rfcs/blob/master/RFC-0000-template.md#resolution).
-    - If the RFC is idle here (no activity for 2 weeks), assign the label `stalled` to the PR.
-- Once the discussion has settled, assign a new label based on the level of support:
-    - `accepted` if a decision has been made in the RFC
-    - `draft` if the author needs to rework the RFC’s proposal
-    - `shelved` if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal
-nor about implementing the described feature until some time in the future.
-- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation. 
-- The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC.
-    - This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section)
-- Once all relevant PRs are merged, the RFC’s status label can be finally updated to `closed`.
-
-</details>
-
-
-
-
-
 # PyTorch-DataLoader-Economic
 
 **Authors:**

From 4c2ea2b85a9f5338d11a87b638a84705849a2ac7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:35:21 +0300
Subject: [PATCH 157/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 9ee2a9d4..f18c7b57 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -97,7 +97,7 @@ The following dataloader input parameters were modified / added:
 | _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
 | _num_workers_batches_ (new)  | number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
-## **Metrics **
+## **Metrics**
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
 To monitor shared memory usage, type in Linux server terminal: \
 $ monitor -n0.1 df -h \

From 00feec4841aa30b280eec89b7422b37b6a9a83af Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:38:48 +0300
Subject: [PATCH 158/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index f18c7b57..2a5fca61 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# PyTorch-DataLoader-Economic
+# DataLoader-Economic 
 
 **Authors:**
 * @yoadbs

From 54a732ecd71c3fb7c75aa5d499d151182f93615f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:44:41 +0300
Subject: [PATCH 159/201] aa

---
 RFC-0000-dataloader-echonomic.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2a5fca61..fff9cb61 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -36,18 +36,18 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 
 ### **Definitions**
 
-| symbol                | description                                                                                                                   |
-|-----------------------|:------------------------------------------------------------------------------------------------------------------------------|
-| _index_queue_         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker    |
-| _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker                       |
-| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process                                                           |
-| _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                          |
-| _item_idx_in_batch_   | Item serial index in batch                                                                                                    |
-| _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                       |
-| _item_index_          | Item's dataset index, as in `dataset.__getitem__(index)`                                                                      |
-| _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                              |
-| _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                       |
-| _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                           |
+| symbol                | description                                                                                                                |
+|-----------------------|:---------------------------------------------------------------------------------------------------------------------------|
+| _index_queue_         | A queue to send items indices and metadata from main process to item_worker. There is a separate queue to each item_worker |
+| _item_queue_          | A queue to send items from item_workers to batch_worker. There is a separate queue to each batch_worker                    |
+| _worker_result_queue_ | A queue to send prepared batches from batch_workers to main process                                                        |
+| _item_idx_            | Item serial index in epoch (0 for first item, 1 for next item, etc.)                                                       |
+| _item_idx_in_batch_   | Item serial index in batch                                                                                                 |
+| _batch_idx_           | Batch serial index in epoch (0 for first batch, 1 for next batch, etc.)                                                    |
+| _item_index_          | Item's dataset index, as in `dataset.__getitem__(index=item_index)`                                                        |
+| _iw_idx_              | Item_worker index {0, 1, ..., _num_workers_ - 1}                                                                           |
+| _bw_idx_              | Batch_worker index {0, 1, ..., _num_batch_workers_ - 1}                                                                    |
+| _batch_size_          | batch size (may be smaller for last batch in epoch)                                                                        |
 
 ### **High Level Description**
 

From 65987293ace724d7379bb4e73055fa1365072e38 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:46:28 +0300
Subject: [PATCH 160/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index fff9cb61..829305e3 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -58,7 +58,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 
 In the suggested pipeline, there are 2 levels of workers: 
 * item_workers - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
-  * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
+  * These workers are similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From e8d285a6da8b00a5be406160b96b230a93c5fd66 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:47:34 +0300
Subject: [PATCH 161/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 829305e3..4fd6e3e1 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,8 +57,8 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_workers - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
-  * These workers are similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
+* item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
+  * This worker is similar to the worker in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_workers - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From 96f4d2d410f150c50b1049871dfa2a7343000f7e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:47:51 +0300
Subject: [PATCH 162/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 4fd6e3e1..a4a0f7de 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 In the suggested pipeline, there are 2 levels of workers: 
 * item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
   * This worker is similar to the worker in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_workers - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
+* batch_worker - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process
 

From 91febb495af29cf95b28a592b1d5a24a77355b35 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Sat, 21 Sep 2024 21:48:19 +0300
Subject: [PATCH 163/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a4a0f7de..56d7e598 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -58,7 +58,7 @@ After a batch is retrieved by the main process, another batch is sent to the app
 
 In the suggested pipeline, there are 2 levels of workers: 
 * item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
-  * This worker is similar to the worker in the current design, but it receives and sends one item at a time (and not one batch at a time) 
+  * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * batch_worker - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
 
 Current design dataflow: main_process -> workers -> main_process

From 28a6cdefeb1e9f36968e3b2499f6a67e4b88513c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 25 Sep 2024 00:05:56 +0300
Subject: [PATCH 164/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 56d7e598..e2ce37a6 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -92,10 +92,10 @@ The following dataloader input parameters were modified / added:
 
 | name                         | description                                                                                                                                                 |
 |------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| _num_workers_ (modified)     | number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
+| _num_workers_ (modified)     | Number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
 |                              |                                                                                                                                                             |
-| _prefetch_factor_ (modified) | number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
-| _num_workers_batches_ (new)  | number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
+| _prefetch_factor_ (modified) | Number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
+| _num_workers_batches_ (new)  | Number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
 
 ## **Metrics**
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From ebcce3d84ace932c580823f6ef9703755bbe65b8 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 25 Sep 2024 00:08:14 +0300
Subject: [PATCH 165/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e2ce37a6..2b3fe03f 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -95,7 +95,7 @@ The following dataloader input parameters were modified / added:
 | _num_workers_ (modified)     | Number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
 |                              |                                                                                                                                                             |
 | _prefetch_factor_ (modified) | Number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
-| _num_workers_batches_ (new)  | Number of batch_workers (default is _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                       |   
+| _num_workers_batches_ (new)  | Number of batch_workers (equal to _prefetch_factor_ by default). There is no benefit in increasing it beyond _prefetch_factor_                              |   
 
 ## **Metrics**
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 81f1c813bb3e90a4c250541ecda1cad33dd88f52 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Wed, 25 Sep 2024 00:09:42 +0300
Subject: [PATCH 166/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2b3fe03f..85f80ccc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -95,7 +95,7 @@ The following dataloader input parameters were modified / added:
 | _num_workers_ (modified)     | Number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
 |                              |                                                                                                                                                             |
 | _prefetch_factor_ (modified) | Number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
-| _num_workers_batches_ (new)  | Number of batch_workers (equal to _prefetch_factor_ by default). There is no benefit in increasing it beyond _prefetch_factor_                              |   
+| _num_workers_batches_ (new)  | Number of batch_workers (defaults to _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                      |   
 
 ## **Metrics**
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \

From 7c6d6551c195e28b15517976a86bd1e37e02ffdc Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:09:18 +0300
Subject: [PATCH 167/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 85f80ccc..9bfcd238 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,7 +52,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker.
+The main process sends _prefetch_factor_ batches to each worker, by a queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 

From b2b345460e5be1f1b86b3508c5241365948ad3ae Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:10:21 +0300
Subject: [PATCH 168/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 9bfcd238..65e98b9f 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,7 +52,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker, by a queue.
+The main process sends _prefetch_factor_ batches to each worker, by the worker's index_queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 

From 875fc21d55547c43709956c47ca6a127018dbd5b Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:10:53 +0300
Subject: [PATCH 169/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 65e98b9f..8a7c729a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,7 +52,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker, by the worker's index_queue.
+The main process sends _prefetch_factor_ batches to each worker, by index_queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 

From e1bba422bd228767b76b2cc7f10d860f88a08f92 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:11:22 +0300
Subject: [PATCH 170/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8a7c729a..65e98b9f 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,7 +52,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker, by index_queue.
+The main process sends _prefetch_factor_ batches to each worker, by the worker's index_queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 

From f35b640bc9e662cc428bc60e7d509ae58a575a84 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:11:59 +0300
Subject: [PATCH 171/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 65e98b9f..8a7c729a 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,7 +52,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker, by the worker's index_queue.
+The main process sends _prefetch_factor_ batches to each worker, by index_queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
 

From 6cd88377f3a43b7713f045dd0656de6cc8401018 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:33:37 +0300
Subject: [PATCH 172/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8a7c729a..473f2dd5 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -54,7 +54,7 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 By the current multiprocessing pipeline, a single level of workers is used. 
 The main process sends _prefetch_factor_ batches to each worker, by index_queue.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
-After a batch is retrieved by the main process, another batch is sent to the appropriate worker.
+After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
 * item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 

From 4d886d6026730fc320c7ad647b0cb6161531b1b7 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:42:41 +0300
Subject: [PATCH 173/201] aa

---
 RFC-0000-dataloader-echonomic.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 473f2dd5..e0149dee 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,9 +57,9 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to shared memory 
-  * This worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - designated to get items from shared memory, prepare batches by running `collate_fn`, and send the prepared batches back to shared memory, for consumption by the main process
+* item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to a batch_worker, by item_queue 
+  * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
+* batch_worker - designated to get items from item_workers, prepare batches by running `collate_fn`, and send the prepared batches back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From 85e2b7a69aec15bbfb0b8f8760e4d5a74541a587 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:43:14 +0300
Subject: [PATCH 174/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e0149dee..695ebe1c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent.
 In the suggested pipeline, there are 2 levels of workers: 
 * item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to a batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - designated to get items from item_workers, prepare batches by running `collate_fn`, and send the prepared batches back to the main process by item_results_queue
+* batch_worker - designated to retrive items from item_queue, prepare batches by running `collate_fn`, and send the prepared batches back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From 019df386fb26e705e408da89ec582a33dba8065e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:43:48 +0300
Subject: [PATCH 175/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 695ebe1c..2c09bda4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent.
 In the suggested pipeline, there are 2 levels of workers: 
 * item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to a batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - designated to retrive items from item_queue, prepare batches by running `collate_fn`, and send the prepared batches back to the main process by item_results_queue
+* batch_worker - designated to retrive items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From f7817efa1dcc61bfaaebf1de9518dfd255600a5c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:45:07 +0300
Subject: [PATCH 176/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 2c09bda4..d0a36a90 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,9 +57,9 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_worker - designated to generate one item at a time (by running `dataset.__getitem__`), and send it to a batch_worker, by item_queue 
+* item_worker - generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - designated to retrive items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
+* batch_worker - retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From dfd4006d12757b7947ade0599c7b5391d5c7b707 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:45:23 +0300
Subject: [PATCH 177/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index d0a36a90..a791beae 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,9 +57,9 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_worker - generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
+* item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
+* batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From fae3a3bdec11866eca636fb4727a79b3d1fe32cc Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:46:09 +0300
Subject: [PATCH 178/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index a791beae..211e320c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,9 +57,9 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
+* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
+* Batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From 2893c9decd7e9519ad0b5007b9991a2d43201000 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:47:37 +0300
Subject: [PATCH 179/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 211e320c..1e6bcddc 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent.
 In the suggested pipeline, there are 2 levels of workers: 
 * Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* Batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and send them back to the main process by item_results_queue
+* Batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and sends them back to the main process by item_results_queue
 
 Current design dataflow: main_process -> workers -> main_process
 

From 0854707c63f205d3cd472a6c869ce8a3be43c316 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:50:14 +0300
Subject: [PATCH 180/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1e6bcddc..8fbbf310 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -79,8 +79,8 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Once the next required batch is available (by _batch_idx_), return batch to caller function
 
 #### **Item_worker Flow**
-* Get item from _index_queue_
-* Run `dataset.__getitem__(item_index)`
+* Get item metadata from _index_queue_
+* Run `dataset.__getitem__(item_index)` to generate item
 * Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**

From 139a983daa13211060294946d40c69359293ea39 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:50:52 +0300
Subject: [PATCH 181/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8fbbf310..1018c0c9 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -80,7 +80,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 #### **Item_worker Flow**
 * Get item metadata from _index_queue_
-* Run `dataset.__getitem__(item_index)` to generate item
+* Generate item, by running `dataset.__getitem__(item_index)`
 * Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**

From 676895fa2729e58b29c60ddc7d2746ac35cfe377 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:52:16 +0300
Subject: [PATCH 182/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 1018c0c9..e94cc343 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -84,7 +84,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* Get one item at a time from _item_queue_ and collect them into batches, by item batch_idx (and batch_size)
+* Get one item at a time from _item_queue_ and collect them into batches, by item's batch_idx, item_idx_in_batch, and batch_size
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to _worker_result_queue_
 
 #### **New Parameters**

From 46ca604f0e5682114a0c0f6c28f5edd3c918c2cf Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:53:11 +0300
Subject: [PATCH 183/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e94cc343..dd5fd16c 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -84,7 +84,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* Get one item at a time from _item_queue_ and collect them into batches, by item's batch_idx, item_idx_in_batch, and batch_size
+* Get one item at a time from _item_queue_ and collect them into batches, by item's metadata (batch_idx, item_idx_in_batch, batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to _worker_result_queue_
 
 #### **New Parameters**

From ad9d88f2e1de1192d341759e7df7bc91e6f84960 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:53:31 +0300
Subject: [PATCH 184/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index dd5fd16c..9abd0ff7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -84,7 +84,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 * Send item to the appropriate _item_queue_ (by item's bw_idx)
 
 #### **Batch_worker Flow**
-* Get one item at a time from _item_queue_ and collect them into batches, by item's metadata (batch_idx, item_idx_in_batch, batch_size)
+* Get one item at a time from _item_queue_ and collect them into batches, by item's metadata (batch_idx, item_idx_in_batch, and batch_size)
 * Once all items of a given batch are received, run collate_fn and send the prepared batch to _worker_result_queue_
 
 #### **New Parameters**

From 4e824af69f4dd623444c24fbeff70106506c5be5 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:56:05 +0300
Subject: [PATCH 185/201] aa

---
 RFC-0000-dataloader-echonomic.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 9abd0ff7..c398b8e4 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -52,14 +52,14 @@ The new flow is introducing only minor modifications in dataloader interface, ma
 ### **High Level Description**
 
 By the current multiprocessing pipeline, a single level of workers is used. 
-The main process sends _prefetch_factor_ batches to each worker, by index_queue.
+The main process sends _prefetch_factor_ batches to each worker, by _index_queue_.
 Each worker prepares one batch at a time, and sends it back to the main process by _worker_result_queue_.
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by item_queue 
+* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by _item_queue_ 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* Batch_worker - Retrives items from item_queue, prepare batches by running `collate_fn`, and sends them back to the main process by item_results_queue
+* Batch_worker - Retrives items from _item_queue_, prepare batches by running `collate_fn`, and sends them back to the main process by _item_results_queue_
 
 Current design dataflow: main_process -> workers -> main_process
 
@@ -68,7 +68,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 #### **Main Process Flow**
 * Retrieve and store prepared batches from _worker_result_queue_
   * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
-* Send batches of items for preparation to _index_queues_, one batch at a time
+* Send batches of items for preparation to index_queues, one batch at a time
   * Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
   * A possibly different item_worker should be assigned to each item
     * Select iw_idx by the item_worker with the minimal workload

From 2e2775c026d3fc53563b40a7c216da08bde2afab Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 09:59:20 +0300
Subject: [PATCH 186/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c398b8e4..e7bb92f3 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -95,7 +95,7 @@ The following dataloader input parameters were modified / added:
 | _num_workers_ (modified)     | Number of item_workers. Setting it to 0 disables multiprocessing (as today). There is no benefit in increasing it beyond (_prefetch_factor_ * _batch_size_) |
 |                              |                                                                                                                                                             |
 | _prefetch_factor_ (modified) | Number of batches simultaneously sent for processing <u>by all workers</u> (2 by default)                                                                   |
-| _num_workers_batches_ (new)  | Number of batch_workers (defaults to _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                      |   
+| _num_batch_workers_ (new)  | Number of batch_workers (defaults to _prefetch_factor_). There is no benefit in increasing it beyond _prefetch_factor_                                      |   
 
 ## **Metrics**
 The suggested flow should require significantly less shared memory, while preserving TPT, using similar configurations. \
@@ -105,7 +105,7 @@ and review /dev/shm "used" column.
 
 ## **Drawbacks**
 * Additional layer of batch_workers is required, somewhat increasing flow complexity
-* CPU usage is somewhat higher in the suggested flow, due to the additional _num_workers_batches_ processes 
+* CPU usage is somewhat higher in the suggested flow, due to the additional _num_batch_workers_ processes 
 * The user should be aware that if `collate_fn` is very slow and becomes a bottleneck, an increase in _prefetch_factor_ should be considered 
   
 

From f9148e69a0007470f7b6faf323060c2ec2864ca6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:01:01 +0300
Subject: [PATCH 187/201] aa

---
 RFC-0000-dataloader-echonomic.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index e7bb92f3..ad553aac 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,5 +110,11 @@ and review /dev/shm "used" column.
   
 
 ## **How we teach this**
-Update Dataloader documentation to include the description of the suggested pipeline. 
-Add/update description of the new/modified parameters.
\ No newline at end of file
+* Update Dataloader documentation to include the description of the suggested pipeline. 
+* Add/update description of the new/modified parameters.
+
+
+
+
+
+

From 3c276c570df3c2f601655a13cdc62dcd4de7e73f Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:01:38 +0300
Subject: [PATCH 188/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index ad553aac..5f0e0129 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -109,7 +109,7 @@ and review /dev/shm "used" column.
 * The user should be aware that if `collate_fn` is very slow and becomes a bottleneck, an increase in _prefetch_factor_ should be considered 
   
 
-## **How we teach this**
+## **How We Teach This**
 * Update Dataloader documentation to include the description of the suggested pipeline. 
 * Add/update description of the new/modified parameters.
 

From 17d265b6bee3c9923fb436fd90aee7612fa4dad9 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:02:35 +0300
Subject: [PATCH 189/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5f0e0129..7aed3528 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# DataLoader-Economic 
+# DataLoader-Economic Feature Suggestion
 
 **Authors:**
 * @yoadbs

From b0e6538b50085b8387773c6f6c9b4ad310e8828c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:02:51 +0300
Subject: [PATCH 190/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 7aed3528..75566dbf 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# DataLoader-Economic Feature Suggestion
+# DataLoader-Economic Pipline Suggestion
 
 **Authors:**
 * @yoadbs

From c20a4e615557589ddc436a8c73e13be8a886849e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:03:29 +0300
Subject: [PATCH 191/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 75566dbf..8d73f8f2 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# DataLoader-Economic Pipline Suggestion
+# DataLoader-Economic Multiprocessing Pipeline Suggestion
 
 **Authors:**
 * @yoadbs

From 67a7aa8b8cd6feadcc3025f5636aca4dd8dde04c Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 10:14:12 +0300
Subject: [PATCH 192/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 8d73f8f2..76b40979 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -110,8 +110,8 @@ and review /dev/shm "used" column.
   
 
 ## **How We Teach This**
-* Update Dataloader documentation to include the description of the suggested pipeline. 
-* Add/update description of the new/modified parameters.
+* Update Dataloader documentation to include the description of the suggested pipeline 
+* Add/update description of the new/modified parameters
 
 
 

From b5d7d14f9fdd05b700558d440a5f1bb5a967cade Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:21:50 +0300
Subject: [PATCH 193/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 76b40979..5c925284 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# DataLoader-Economic Multiprocessing Pipeline Suggestion
+# DataLoader-Economic Multiprocessing Pipeline Design Suggestion
 
 **Authors:**
 * @yoadbs

From 53a6086acd4a68d23b75e7c3527019bf11d51482 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:24:10 +0300
Subject: [PATCH 194/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 5c925284..c0e49a33 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -1,4 +1,4 @@
-# DataLoader-Economic Multiprocessing Pipeline Design Suggestion
+# Economic DataLoader: Multiprocessing Pipeline Design Suggestion
 
 **Authors:**
 * @yoadbs

From 39c1f50895a40d7b6a45ea547b6f1238ab80bcfa Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:35:47 +0300
Subject: [PATCH 195/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index c0e49a33..18be4327 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent.
 In the suggested pipeline, there are 2 levels of workers: 
 * Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by _item_queue_ 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* Batch_worker - Retrives items from _item_queue_, prepare batches by running `collate_fn`, and sends them back to the main process by _item_results_queue_
+* Batch_worker - Retrives items from _item_queue_, prepare batches by running `collate_fn`, and sends them back to the main process by _worker_result_queue_
 
 Current design dataflow: main_process -> workers -> main_process
 

From ddfc8f09d6e07cc167f82b35f68cf1a384fd8bfa Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:43:19 +0300
Subject: [PATCH 196/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 18be4327..91310eb8 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,7 +57,7 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by _item_queue_ 
+* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and sends it to a designated batch_worker, by _item_queue_ 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
 * Batch_worker - Retrives items from _item_queue_, prepare batches by running `collate_fn`, and sends them back to the main process by _worker_result_queue_
 

From 75fd80e168bc10465524e28fce95b34dcd8c1abb Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:44:28 +0300
Subject: [PATCH 197/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 91310eb8..77054c41 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -59,7 +59,7 @@ After a batch is retrieved by the main process, another batch is sent.
 In the suggested pipeline, there are 2 levels of workers: 
 * Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and sends it to a designated batch_worker, by _item_queue_ 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* Batch_worker - Retrives items from _item_queue_, prepare batches by running `collate_fn`, and sends them back to the main process by _worker_result_queue_
+* Batch_worker - Retrives items from _item_queue_, prepares batches by running `collate_fn`, and sends them back to the main process by _worker_result_queue_
 
 Current design dataflow: main_process -> workers -> main_process
 

From 2422d7e33e48ae54b2309a60615c5c9ab5b80855 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:45:48 +0300
Subject: [PATCH 198/201] aa

---
 RFC-0000-dataloader-echonomic.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index 77054c41..dc5374e7 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -57,9 +57,9 @@ Each worker prepares one batch at a time, and sends it back to the main process
 After a batch is retrieved by the main process, another batch is sent.
 
 In the suggested pipeline, there are 2 levels of workers: 
-* Item_worker - Generates one item at a time (by running `dataset.__getitem__`), and sends it to a designated batch_worker, by _item_queue_ 
+* Item_worker - Generate one item at a time (by running `dataset.__getitem__`), and send it to a designated batch_worker, by _item_queue_ 
   * The item_worker is similar to the workers in the current design, but it receives and sends one item at a time (and not one batch at a time) 
-* Batch_worker - Retrives items from _item_queue_, prepares batches by running `collate_fn`, and sends them back to the main process by _worker_result_queue_
+* Batch_worker - Retrive items from _item_queue_, prepare batches by running `collate_fn`, and send them back to the main process by _worker_result_queue_
 
 Current design dataflow: main_process -> workers -> main_process
 

From 4bedb9a400cd43b8a1126f83a52d2f96250ddb4a Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:47:28 +0300
Subject: [PATCH 199/201] aa

---
 RFC-0000-dataloader-echonomic.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-dataloader-echonomic.md
index dc5374e7..61c73363 100644
--- a/RFC-0000-dataloader-echonomic.md
+++ b/RFC-0000-dataloader-echonomic.md
@@ -67,7 +67,7 @@ Suggested design dataflow: main_process -> item_workers -> batch_workers -> main
 
 #### **Main Process Flow**
 * Retrieve and store prepared batches from _worker_result_queue_
-  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item-workers, when retrieving the batch 
+  * Track number of items at work (workload) by each worker. Make sure to reduce workload counter for the relevant batch_worker, and for each of the relevant item_workers, when retrieving the batch 
 * Send batches of items for preparation to index_queues, one batch at a time
   * Each item should include the following metadata: (_item_idx_in_batch_, _batch_idx_, _item_index_, _iw_idx_, _bw_idx_, _batch_size_):
   * A possibly different item_worker should be assigned to each item

From 38717ac8be717c2de45adf75d05b8b0fc03757e6 Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 17:52:49 +0300
Subject: [PATCH 200/201] aa

---
 ...000-dataloader-echonomic.md => RFC-0000-economic-dataloader.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename RFC-0000-dataloader-echonomic.md => RFC-0000-economic-dataloader.md (100%)

diff --git a/RFC-0000-dataloader-echonomic.md b/RFC-0000-economic-dataloader.md
similarity index 100%
rename from RFC-0000-dataloader-echonomic.md
rename to RFC-0000-economic-dataloader.md

From c7b25aa53b0879bcf35a792e7f61ed6c44b0e02e Mon Sep 17 00:00:00 2001
From: Yoad Bar-Shean <y_barshean@apple.com>
Date: Fri, 27 Sep 2024 18:11:27 +0300
Subject: [PATCH 201/201] aa

---
 ...0000-economic-dataloader.md => RFC-0001-economic-dataloader.md | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename RFC-0000-economic-dataloader.md => RFC-0001-economic-dataloader.md (100%)

diff --git a/RFC-0000-economic-dataloader.md b/RFC-0001-economic-dataloader.md
similarity index 100%
rename from RFC-0000-economic-dataloader.md
rename to RFC-0001-economic-dataloader.md