From a710c6804c673f9d08aa3448e903ffd96fd1e770 Mon Sep 17 00:00:00 2001 From: sirluk Date: Tue, 8 Oct 2024 22:37:17 +0200 Subject: [PATCH] incremental pca rfc --- RFC-0039-incremental-pca.md | 91 +++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 RFC-0039-incremental-pca.md diff --git a/RFC-0039-incremental-pca.md b/RFC-0039-incremental-pca.md new file mode 100644 index 0000000..8d45f33 --- /dev/null +++ b/RFC-0039-incremental-pca.md @@ -0,0 +1,91 @@ +# [Implementation of Incremental PCA] + +**Authors:** +* @sirluk + + +## **Summary** +Implement a class for Incremental PCA in vanilla PyTorch with native GPU support. + + +## **Motivation** +- Incremental PCA is important where data arrives in streams or is too large to fit in memory. +- PyTorch currently lacks a built-in implementation of incremental PCA, and the best available implementations, like `sklearn`, do not support have GPU support. +- Requested in issue tracker: [issue40770](https://github.com/pytorch/pytorch/issues/40770) + + +## **Proposed Implementation** +The proposed implementation would be similar to the one in `sklearn` ([sklearn.decomposition.IncrementalPCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html)) but with the following differences: +- torch.tensor objects instead of numpy arrays +- GPU support +- Option to use truncated SVD via torch.svd_lowrank + +Draft implementation can be found [here](https://github.com/sirluk/pytorch_incremental_pca). Potentially it could make sense to inherit from torch.nn.Module and define internal state as buffers. + + +## **Metrics ** +- Ability to handle datasets larger than available RAM +- GPU vs. CPU performance comparison +- Accuracy comparison with sklearn's IncrementalPCA + +## **Drawbacks** +- Implementation could be relatively self-contained, no additional dependencies required. Can reuse existing linear algebra functions in pytorch. Draft implementation already created. +- Tests required to ensure numerical stability accross various data types and sizes. + + +## **Alternatives** +What other designs have been considered? +- Use scikit-learn's Incremental PCA by moving data to CPU. Very slow. +What is the impact of not doing this? +- Users might resort to less efficient workarounds or third-party libraries +- Potential migration of users to other tools for specific PCA-heavy workflows + +## **Prior Art** +- Implemented in sklearn: [sklearn.decomposition.IncrementalPCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html) +- Large parts of the sklearn implementation can be reused + + +## **How we teach this** +- "IncrementalPCA" or "IPCA" for the main class name. Clearly indicates the incremental nature of the algorithm +- Use terms like "partial_fit" for incremental updates. +- Emphasize use cases: streaming data, large datasets, online learning +- Demonstrate GPU speedup with benchmarks +- Provide code examples demonstrating integration with PyTorch datasets and dataloaders + +## **Unresolved questions** +- Where in the pytorch codebase should this live? +- What should the API look like? +- Should this be implemented as a torch.nn.Module or should it be a standalone class? +- Should the device be implicit (inferred from the input tensor) or explicit? +- Should this work with multiple GPUs out of the box? +Out of scope for this RFC: +- Multi-GPU support + +## Resolution +We decided to do it. X% of the engineering team actively approved of this change. + +### Level of Support +Choose one of the following: +* 1: Overwhelming positive feedback. +* 2: Positive feedback. +* 3: Majority Acceptance, with conflicting Feedback. +* 4: Acceptance, with Little Feedback. +* 5: Unclear Resolution. +* 6: RFC Rejected. +* 7: RFC Rejected, with Conflicting Feedback. + + +#### Additional Context +Some people were in favor of it, but some people didn’t want it for project X. + + +### Next Steps +Will implement it. + + +#### Tracking issue + + + +#### Exceptions +Not implementing on project X now. Will revisit the decision in 1 year.