Subsampling with replacement #471

noahho · 2025-08-27T09:24:59Z

Motivation and Context

Public API Changes

No Public API changes
Yes, Public API changes (Details below)

How Has This Been Tested?

Checklist

The changes have been tested locally.
Documentation has been updated (if the public API or usage changes).
A entry has been added to CHANGELOG.md (if relevant for users).
The code follows the project's style guidelines.
I have considered the impact of these changes on the public API.

…he new names

# Conflicts: # examples/notebooks/TabPFN_Demo_Local.ipynb # src/tabpfn/preprocessing.py

…acement

Copilot

Pull Request Overview

This PR enhances the generate_index_permutations function to support subsampling with replacement, providing more flexibility for data sampling strategies.

Adds a with_replacement parameter to enable sampling indices multiple times
Refactors parameter validation logic for better clarity and consistency
Updates documentation to reflect the new functionality and parameter changes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/tabpfn/preprocessing.py

gemini-code-assist

Code Review

This pull request introduces subsampling with replacement by adding a with_replacement flag to the generate_index_permutations function. The implementation is well-structured and correctly uses numpy.random.Generator.choice for sampling with replacement. The refactoring also improves the existing logic for determining the sample size and provides clearer validation and error messages.

My main feedback is to address an edge case where max_index is 0, which can cause a ValueError when sampling with replacement. Adding a check at the beginning of the function will make it more robust.

gemini-code-assist · 2025-08-27T09:26:52Z

src/tabpfn/preprocessing.py

    _, rng = infer_random_state(random_state)
-    if isinstance(subsample, int):
-        if subsample < 1:
-            raise ValueError(f"{subsample=} must be larger than 1 if int")
-        subsample = min(subsample, max_index)
-
-        return [rng.permutation(max_index)[:subsample] for _ in range(n)]

+    # Determine the number of items to subsample (k)
    if isinstance(subsample, float):
-        if not (0 < subsample < 1):
-            raise ValueError(f"{subsample=} must be in (0, 1) if float")
-        subsample = int(subsample * max_index) + 1
-        return [rng.permutation(max_index)[:subsample] for _ in range(n)]
-
-    raise ValueError(f"{subsample=} must be int or float.")
+        if not (0.0 < subsample <= 1.0):
+            raise ValueError(f"If float, {subsample=} must be in (0, 1].")
+        # Ensure at least one sample is drawn
+        k = max(1, int(subsample * max_index))
+    elif isinstance(subsample, int):
+        if subsample < 1:
+            raise ValueError(f"If int, {subsample=} must be at least 1.")
+        k = subsample
+    else:
+        raise TypeError(f"{subsample=} must be an int or float.")
+
+    # Generate n lists of indices based on the replacement strategy
+    if with_replacement:
+        # Sample with replacement. The sample size `k` can be larger than `max_index`.
+        return [rng.choice(max_index, size=k, replace=True) for _ in range(n)]
+    # Sample without replacement. The sample size cannot exceed the population size.
+    sample_size = min(k, max_index)
+    return [rng.permutation(max_index)[:sample_size] for _ in range(n)]


When max_index is 0, k is calculated as at least 1. This causes rng.choice(0, size=k, ...) to raise a ValueError when sampling with replacement, as it's impossible to sample from an empty population. The behavior is also inconsistent with the with_replacement=False path, which correctly returns an empty array.

To handle this edge case gracefully and ensure consistent behavior, it's best to add a check for max_index == 0 at the beginning of the function's logic. It would also be good practice to validate that max_index is not negative.

if max_index < 0: raise ValueError(f"max_index must be non-negative, but got {max_index}") if max_index == 0: return [np.array([], dtype=np.int64) for _ in range(n)] _, rng = infer_random_state(random_state) # Determine the number of items to subsample (k) if isinstance(subsample, float): if not (0.0 < subsample <= 1.0): raise ValueError(f"If float, {subsample=} must be in (0, 1].") # Ensure at least one sample is drawn k = max(1, int(subsample * max_index)) elif isinstance(subsample, int): if subsample < 1: raise ValueError(f"If int, {subsample=} must be at least 1.") k = subsample else: raise TypeError(f"{subsample=} must be an int or float.") # Generate n lists of indices based on the replacement strategy if with_replacement: # Sample with replacement. The sample size `k` can be larger than `max_index`. return [rng.choice(max_index, size=k, replace=True) for _ in range(n)] # Sample without replacement. The sample size cannot exceed the population size. sample_size = min(k, max_index) return [rng.permutation(max_index)[:sample_size] for _ in range(n)]

Co-authored-by: Copilot <[email protected]>

…to subsampling-with-replacement

noahho and others added 30 commits July 15, 2025 16:32

- Change default estimators for classifier from 4 to 8

42464e3

Merge remote-tracking branch 'origin/main'

abc48a8

Merge remote-tracking branch 'origin/main'

afa7fb3

Merge remote-tracking branch 'origin/main'

775bb5e

Merge remote-tracking branch 'origin/main'

dad171f

Merge remote-tracking branch 'origin/main'

5062414

Merge remote-tracking branch 'origin/main'

2ce9d5b

Merge remote-tracking branch 'origin/main'

eee5d91

attempt to fix the naming of bar distributions

db3a919

ruff fix

bef0d24

ruff fix on the ipynb

402b3da

naming change

ad52ac4

resolve gemini suggestions

16efb80

ruff

0c030c9

debug test

39914d9

Delete my local runs

1902ac1

adding attributes to allow using both the old naming convention and t…

7a7c914

…he new names

python compatibility issue on dataclass

882e758

ruff

cf94778

Fixed the comments

76ef0fe

call the ys znorm. the x are still preprocessed.

f769a8f

simplify preprocessing bardist attribute

2995386

Merge remote-tracking branch 'origin/main'

06014f0

refactor internally attempt

5ad66a9

ruff

753fbe8

Merge remote-tracking branch 'origin/main'

879d428

Merge branch 'main' into finetuning-debugging

ac9a8bd

Merge remote-tracking branch 'origin/main' into finetuning-debugging

32a8ff6

# Conflicts: # examples/notebooks/TabPFN_Demo_Local.ipynb # src/tabpfn/preprocessing.py

- add subsampling with replacement

a283d4f

Merge remote-tracking branch 'origin/main' into subsampling-with-repl…

e3eadb5

…acement

Copilot AI review requested due to automatic review settings August 27, 2025 09:24

Copilot AI reviewed Aug 27, 2025

View reviewed changes

src/tabpfn/preprocessing.py Show resolved Hide resolved

src/tabpfn/preprocessing.py Outdated Show resolved Hide resolved

src/tabpfn/preprocessing.py Show resolved Hide resolved

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

noahho and others added 5 commits August 27, 2025 19:41

- add subsampling with replacement

56ff60b

Update src/tabpfn/preprocessing.py

b4968d0

Co-authored-by: Copilot <[email protected]>

- add subsampling with replacement

d46571c

Merge remote-tracking branch 'origin/subsampling-with-replacement' in…

809327e

…to subsampling-with-replacement

- add subsampling with replacement

3714ec2

LeoGrin requested review from LeoGrin and removed request for LeoGrin October 14, 2025 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subsampling with replacement #471

Subsampling with replacement #471

Uh oh!

noahho commented Aug 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Subsampling with replacement #471

Are you sure you want to change the base?

Subsampling with replacement #471

Uh oh!

Conversation

noahho commented Aug 27, 2025

Motivation and Context

Public API Changes

How Has This Been Tested?

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants