diff --git a/README.md b/README.md index 0bd90be..1163fec 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -

⚡ANSR:
Flash Amortized Neural Symbolic Regression

+

⚡Flash-ANSR:
Fast Amortized Neural Symbolic Regression

@@ -106,7 +106,7 @@ Coming soon title = {Flash Amortized Neural Symbolic Regression}, year = {2024}, publisher = {GitHub}, - version = {0.4.4}, + version = {0.4.5}, url = {https://github.com/psaegert/flash-ansr} } ``` diff --git a/docs/evaluation.md b/docs/evaluation.md index 337c6ea..6e50d58 100644 --- a/docs/evaluation.md +++ b/docs/evaluation.md @@ -41,217 +41,46 @@ 4. Install the required sympytorch fork: `pip install git+https://github.com/pakamienny/sympytorch.git`. 5. Download the pretrained checkpoint to `e2e/model1.pt` (mirror of https://dl.fbaipublicfiles.com/symbolicregression/model1.pt). Keep the filename as-is; the scaling config points there. -## Express +## Configs at a glance -Use, copy or modify a config in `./configs`: +- Evaluation configs live under `configs/evaluation/` (families: `scaling/`, `noise_sweep/`, `support_sweep/`). +- Each file is a single run definition: `data_source`, `model_adapter`, and `runner` blocks. +- Multi-experiment configs run **all** experiments when `--experiment` is omitted; pass a name to isolate one. +- Outputs default to `results/evaluation/...` as specified in the config; override with `-o/--output-file`. -``` -./configs -├── my_config -│   ├── dataset_train.yaml # Link to skeleton pool and padding for training -│   ├── dataset_val.yaml # Link to skeleton pool and padding for validation -│   ├── tokenizer.yaml # Tokenizer settings -│   ├── model.yaml # Model settings and link to simplipy engine -│   ├── skeleton_pool_train.yaml # Sampling and holdout settings for training -│   ├── skeleton_pool_val.yaml # Sampling and holdout settings for validation -│   └── train.yaml # Data and schedule for training -``` - -Use the helper scripts to import data, build validation sets, and kick off training: - -```sh -./scripts/import_test_sets.sh # optional, required only once per checkout -./scripts/generate_validation_set.sh my_config # prepares validation skeletons -./scripts/train.sh my_config # trains using configs/my_config -``` - -For more information see below. - -## Manual - -### 0. Prerequisites - -Test data structured as follows: - -```sh -./data/ansr-data/test_set -├── fastsrb -│   └── expressions.yaml -``` - -The test data can be cloned from the Hugging Face data repository: - -```sh -git clone https://huggingface.co/psaegert/ansr-data data/ansr-data -``` - -### 1. Import test data - -External datasets must be imported into the supported format: +## Step-by-step run guide -```sh -flash_ansr import-data -i "{{ROOT}}/data/ansr-data/test_set/fastsrb/expressions.yaml" -p "fastsrb" -e "dev_7-3" -b "{{ROOT}}/configs/test_set/skeleton_pool.yaml" -o "{{ROOT}}/data/ansr-data/test_set/fastsrb/skeleton_pool" -v -``` - -with - -- `-i` the input file -- `-p` the name of the parser implemented in `./src/flash_ansr/compat/convert_data.py` -- `-e` the SimpliPy engine version to use for simplification -- `-b` the config of a base skeleton pool to add the data to -- `-o` the output directory for the resulting skeleton pool -- `-v` verbose output - -This will create and save a skeleton pool with the parsed imported skeletons in the specified directory: +### 0. Benchmark data -```sh -./data/ansr-data/test_set/ -└── skeleton_pool - ├── skeleton_pool.yaml - └── skeletons.pkl -``` - -### 2. Generate validation data - -Validation data is generated by randomly sampling according to the settings in the skeleton pool config: +Fetch the FastSRB benchmark once (if you do not already have `data/ansr-data/test_set/fastsrb/expressions.yaml`): ```sh -flash_ansr generate-skeleton-pool -c {{ROOT}}/configs/${CONFIG}/skeleton_pool_val.yaml -o {{ROOT}}/data/ansr-data/${CONFIG}/skeleton_pool_val -s 5000 -v +mkdir -p "{{ROOT}}/data/ansr-data/test_set/fastsrb" +wget -O "{{ROOT}}/data/ansr-data/test_set/fastsrb/expressions.yaml" \ + "https://raw.githubusercontent.com/viktmar/FastSRB/refs/heads/main/src/expressions.yaml" ``` -with - -- `-c` the skeleton pool config -- `-o` the output directory to save the skeleton pool -- `-s` the number of unique skeletons to sample -- `-v` verbose output +This writes `skeleton_pool.yaml` and `skeletons.pkl` under the specified output directory. -### 3. Train the model - -```sh -flash_ansr train -c {{ROOT}}/configs/${CONFIG}/train.yaml -o {{ROOT}}/models/ansr-models/${CONFIG} -v -ci 100000 -vi 10000 -``` - -with - -- `-c` the training config -- `-o` the output directory to save the model and checkpoints -- `-v` verbose output -- `-ci` the interval to save checkpoints -- `-vi` the interval for validation - -### 4. Evaluate the model - -⚡ANSR, PySR, NeSymReS, E2E, skeleton-pool, brute-force, and the FastSRB benchmark run through a shared evaluation engine. -Each run is configured in a single YAML that wires a **data source**, a **model adapter**, and runtime **runner** settings. -The common CLI entry point is: +### 1. Run evaluation ```sh flash_ansr evaluate-run -c configs/evaluation/scaling/v23.0-20M_fastsrb.yaml --experiment flash_ansr_fastsrb_choices_00032 -v ``` - -Use `-n/--limit`, `--save-every`, `-o/--output-file`, `--experiment `, or `--no-resume` to temporarily override the config without editing the file. When a config defines multiple experiments (see `configs/evaluation/scaling/`), omitting `--experiment` now runs **all** of them sequentially; pass an explicit name if you only want a single sweep entry. - -#### 4.1 Config-driven workflow - -Every run config (see `configs/evaluation/*.yaml`) follows the same structure: - -```yaml -run: - data_source: # how to create evaluation samples - ... - model_adapter: # which model/baseline to call - ... - runner: # bookkeeping + persistence - limit: 5000 - save_every: 250 - output: "{{ROOT}}/results/evaluation/v23.0-20M/fastsrb.pkl" - resume: true -``` - -- **`data_source`** selects where problems come from. `type: skeleton_dataset` streams from a `FlashANSRDataset`, while `type: fastsrb` reads the FastSRB YAML benchmark. Common knobs include `n_support`, `noise_level`, and target sizes. Provide `datasets_per_expression` to iterate each skeleton or FastSRB equation deterministically with a fixed number of generated datasets (handy for reproducible evaluation sweeps). -- **`model_adapter`** declares the solver. Supported values today are `flash_ansr`, `pysr`, `nesymres`, `skeleton_pool`, `brute_force`, and `e2e`, each with their own required fields (model paths, timeout/beam/samples knobs, etc.). -- **`runner`** controls persistence: `limit` caps the number of processed samples, `save_every` checkpoints incremental progress to `output`, and `resume` decides whether to load previous results from that file. - -When `resume` is enabled the engine simply reloads the existing pickle, skips that many deterministic samples, and keeps writing to the same file. If a dataset cannot be generated within `max_trials`, the runner now appends a placeholder entry (`placeholder=True`, `placeholder_reason=...`) so the results length still reflects every attempted expression/dataset pair. Downstream analysis can filter those placeholders, but their presence keeps pause/resume logic trivial and avoids juggling extra state files. Skeleton dataset evaluations remain sequential—`datasets_per_expression` (default `1`) controls how many deterministic datasets are emitted per skeleton, and the previous random sampling mode has been removed. - -Running `flash_ansr evaluate-run ...` loads the config, resumes any previously saved pickle, instantiates the requested data/model pair, and streams results back into the same output file. - -#### 4.2 Example run configs - -Ready-to-use configs live under `configs/evaluation/scaling/` (with matching `noise_sweep/` and `support_sweep/` variants). All shipped experiments target FastSRB; the `*_v23_val.yaml` siblings swap in the v23 validation skeleton pool. - -##### 4.2.1 FlashANSR - -`configs/evaluation/scaling/v23.0-20M_fastsrb.yaml` (plus the 3M and 120M variants) sweep SoftmaxSampling `choices`. Example: +or ```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/v23.0-20M_fastsrb.yaml \ - --experiment flash_ansr_fastsrb_choices_00032 -v +flash_ansr evaluate-run -c configs/evaluation/scaling/v23.0-20M_fastsrb.yaml -v ``` +to run all experiments in the config. -##### 4.2.2 PySR - -`configs/evaluation/scaling/pysr_fastsrb.yaml` mirrors the same sweep over `niterations`. Run a single point with: - -```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/pysr_fastsrb.yaml \ - --experiment pysr_fastsrb_iter_00032 -v -``` - -For long sweeps, `python scripts/evaluate_PySR.py -c --experiment -v` restarts jobs if PySR stalls. - -##### 4.2.3 NeSymReS - -`configs/evaluation/scaling/nesymres_fastsrb.yaml` varies `beam_width` for the 100M checkpoint tracked under `models/nesymres/`. Example: - -```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/nesymres_fastsrb.yaml \ - --experiment nesymres_fastsrb_beam_width_00008 -v -``` - -##### 4.2.4 Skeleton pool baseline - -`configs/evaluation/scaling/skeleton_pool_fastsrb.yaml` samples skeletons directly from `data/ansr-data/test_set/fastsrb/skeleton_pool_max8`. Example: - -```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/skeleton_pool_fastsrb.yaml \ - --experiment skeleton_pool_fastsrb_samples_00032 -v -``` - -##### 4.2.5 Brute force baseline - -`configs/evaluation/scaling/brute_force_fastsrb.yaml` exhaustively enumerates skeletons up to `max_expressions`. Example: - -```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/brute_force_fastsrb.yaml \ - --experiment brute_force_fastsrb_max_expressions_00064 -v -``` - -##### 4.2.6 E2E baseline - -`configs/evaluation/scaling/e2e_fastsrb.yaml` sweeps `model_adapter.candidates_per_bag` (the beam size). Example: - -```sh -flash_ansr evaluate-run \ - -c configs/evaluation/scaling/e2e_fastsrb.yaml \ - --experiment e2e_fastsrb_candidates_00016 -v -``` - -##### 4.2.7 Compute-scaling sweeps - -All scaling configs are multi-experiment. Omit `--experiment` to run the full sweep; the primary knobs are: +- Adjust `-c` to any file under `configs/evaluation/` and optionally set `--experiment`. +- Override on the fly: `-n/--limit`, `--save-every`, `-o/--output-file`, `--no-resume`. +- The runner loads existing partial pickles, skips processed items, and appends new results. If sample generation fails within `max_trials`, a placeholder entry is written to preserve counts. -- **FlashANSR**: `generation_overrides.kwargs.choices` -- **PySR**: `niterations` -- **NeSymReS**: `beam_width` -- **SkeletonPool**: `samples` -- **BruteForce**: `max_expressions` -- **E2E**: `candidates_per_bag` +### 2. Example configs -Outputs are namespaced under `results/evaluation/scaling///...` so sweeps can run back-to-back. +- FlashANSR v23.0-20M scaling: `configs/evaluation/scaling/v23.0-20M_fastsrb.yaml` +- PySR scaling: `configs/evaluation/scaling/pysr_fastsrb.yaml` +- NeSymReS scaling: `configs/evaluation/scaling/nesymres_fastsrb.yaml` +- E2E baseline: `configs/evaluation/scaling/e2e_fastsrb.yaml` diff --git a/docs/getting_started.md b/docs/getting_started.md index 6a3f1e0..c49aa00 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -15,29 +15,44 @@ See [all available models on Hugging Face](https://huggingface.co/models?search= ## Minimal inference Example ```python -import numpy as np -from flash_ansr import FlashANSR, SoftmaxSamplingConfig, get_path - -# Define some data -X = np.random.randn(256, 2) -y = X[:, 0] + X[:, 1] - -# Load the model (assuming v23.0-120M is installed) +import torch +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + +# Import flash_ansr +from flash_ansr import ( + FlashANSR, + SoftmaxSamplingConfig, + install_model, + get_path, +) + +# Select a model from Hugging Face +# https://huggingface.co/models?search=flash-ansr-v23.0 +MODEL = "psaegert/flash-ansr-v23.0-120M" + +# Download the latest snapshot of the model +# By default, the model is downloaded to the directory `./models/` in the package root +install_model(MODEL) + +# Load the model model = FlashANSR.load( - directory=get_path('models', 'psaegert/flash-ansr-v23.0-120M'), - generation_config=SoftmaxSamplingConfig(choices=256), -) # .to(device) for GPU. Highly recommended. + directory=get_path('models', MODEL), + generation_config=SoftmaxSamplingConfig(choices=32), # or BeamSearchConfig / MCTSGenerationConfig + n_restarts=8, +).to(device) -# Find an expression that fits the data by sampling from the model +# Define data +X = ... +y = ... + +# Fit the model to the data model.fit(X, y, verbose=True) -print("Expression:", model.get_expression()) +# Show the best expression +print(model.get_expression()) +# Predict with the best expression y_pred = model.predict(X) -print("Predictions:", y_pred[:5]) - -# All results are stored in model.results as a pandas DataFrame -model.results ``` Find more details in the [API Reference](api.md). diff --git a/docs/index.md b/docs/index.md index c33e536..77fce91 100644 --- a/docs/index.md +++ b/docs/index.md @@ -15,16 +15,44 @@ pip install flash-ansr flash_ansr install psaegert/flash-ansr-v23.0-120M ``` ```python -import numpy as np -from flash_ansr import FlashANSR, SoftmaxSamplingConfig, get_path +import torch +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") -X = np.random.randn(256, 2) -model = FlashANSR.load( - directory=get_path('models', 'psaegert/flash-ansr-v23.0-120M'), - generation_config=SoftmaxSamplingConfig(choices=512), +# Import flash_ansr +from flash_ansr import ( + FlashANSR, + SoftmaxSamplingConfig, + install_model, + get_path, ) -expr = model.fit(X, X[:, 0] + X[:, 1]) -print(expr) + +# Select a model from Hugging Face +# https://huggingface.co/models?search=flash-ansr-v23.0 +MODEL = "psaegert/flash-ansr-v23.0-120M" + +# Download the latest snapshot of the model +# By default, the model is downloaded to the directory `./models/` in the package root +install_model(MODEL) + +# Load the model +model = FlashANSR.load( + directory=get_path('models', MODEL), + generation_config=SoftmaxSamplingConfig(choices=32), # or BeamSearchConfig / MCTSGenerationConfig + n_restarts=8, +).to(device) + +# Define data +X = ... +y = ... + +# Fit the model to the data +model.fit(X, y, verbose=True) + +# Show the best expression +print(model.get_expression()) + +# Predict with the best expression +y_pred = model.predict(X) ``` ## Serving these docs locally diff --git a/docs/training.md b/docs/training.md index 3f99049..72a707f 100644 --- a/docs/training.md +++ b/docs/training.md @@ -19,8 +19,13 @@ ``` Produces checkpoints under `models/ansr-models/test/` with `model.yaml`, `tokenizer.yaml`, and `state_dict.pt`. +## Helper scripts +- `./scripts/import_test_sets.sh`: import benchmark skeletons once so training excludes evaluation holdouts. +- `./scripts/generate_validation_set.sh `: create held-out skeleton pools matching your bundle. +- `./scripts/train.sh `: convenience wrapper to launch training with the bundle. + ## Full training workflow -1. **Import test sets**: Ajdust and run `./scripts/import_test_sets.sh` to import test sets. The data generating processes during training will exclude these skeletons to ensure fair evaluation. +1. **Import test sets**: Adjust and run `./scripts/import_test_sets.sh` to import test sets. The data generating processes during training will exclude these skeletons to ensure fair evaluation. 2. **Configure skeleton pools and datasets**: Adjust the `skeleton_pool_*.yaml` and `dataset_*.yaml` files inside your chosen config bundle to set operator priors, expression depths, and data sampling strategies. 3. **Prepare held out skeleton pools** (optional if reusing shipped ones): ```bash diff --git a/pyproject.toml b/pyproject.toml index 2622b90..60cbc8d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,26 +1,63 @@ [project] name = "flash_ansr" -description = "Flash Amortized Neural Symbolic Regression" +description = "Flash-ANSR: Fast Amortized Neural Symbolic Regression - Discover symbolic expressions from tabular data using SetTransformer and Transformer architectures" authors = [ {name = "Paul Saegert"}, - ] +] readme = "README.md" requires-python = ">=3.12" dynamic = ["dependencies"] -version = "0.4.4" +version = "0.4.5" license = "MIT" license-files = ["LICEN[CS]E*"] +keywords = [ + "symbolic-regression", + "neural-symbolic-regression", + "simulation-based-inference", + "transformer", + "set-transformer", + "machine-learning", + "pytorch", + "expression-discovery", + "equation-discovery", + "tabular-data", + "mathematical-modeling" +] + +classifiers = [ + # Development status + "Development Status :: 4 - Beta", + + # Intended audience + "Intended Audience :: Science/Research", + "Intended Audience :: Developers", + + # Topic areas - focus on core functionality + "Topic :: Scientific/Engineering :: Artificial Intelligence", + "Topic :: Scientific/Engineering :: Mathematics", + "Topic :: Software Development :: Libraries :: Python Modules", + + # Programming language + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.12", + + # Operating system + "Operating System :: OS Independent", +] + [project.urls] Homepage = "https://github.com/psaegert/flash-ansr" +Documentation = "https://flash-ansr.readthedocs.io/en/latest/" +Repository = "https://github.com/psaegert/flash-ansr" Issues = "https://github.com/psaegert/flash-ansr/issues" PyPI = "https://pypi.org/project/flash-ansr/" -ReadtheDocs = "https://flash-ansr.readthedocs.io/" +"Bug Reports" = "https://github.com/psaegert/flash-ansr/issues" +"Demo Notebook" = "https://github.com/psaegert/flash-ansr/blob/main/experimental/demo.ipynb" [project.scripts] flash_ansr = "flash_ansr.__main__:main" - [tool.setuptools.dynamic] dependencies = {file = ["requirements.txt"]} diff --git a/src/flash_ansr/baselines/brute_force_model.py b/src/flash_ansr/baselines/brute_force_model.py index 903355e..2c312cb 100644 --- a/src/flash_ansr/baselines/brute_force_model.py +++ b/src/flash_ansr/baselines/brute_force_model.py @@ -60,7 +60,7 @@ def __init__( self.refiner_method = refiner_method self.refiner_p0_noise = refiner_p0_noise if refiner_p0_noise_kwargs == 'default': - refiner_p0_noise_kwargs = {'low': -5, 'high': 5} + refiner_p0_noise_kwargs = {'loc': 0.0, 'scale': 5.0} self.refiner_p0_noise_kwargs = copy.deepcopy(refiner_p0_noise_kwargs) if refiner_p0_noise_kwargs is not None else None self.numpy_errors = numpy_errors self.parsimony = parsimony diff --git a/src/flash_ansr/baselines/skeleton_pool_model.py b/src/flash_ansr/baselines/skeleton_pool_model.py index 0c764df..419d4bc 100644 --- a/src/flash_ansr/baselines/skeleton_pool_model.py +++ b/src/flash_ansr/baselines/skeleton_pool_model.py @@ -67,7 +67,7 @@ def __init__( self.refiner_method = refiner_method self.refiner_p0_noise = refiner_p0_noise if refiner_p0_noise_kwargs == 'default': - refiner_p0_noise_kwargs = {'low': -5, 'high': 5} + refiner_p0_noise_kwargs = {'loc': 0.0, 'scale': 5.0} self.refiner_p0_noise_kwargs = copy.deepcopy(refiner_p0_noise_kwargs) if refiner_p0_noise_kwargs is not None else None self.numpy_errors = numpy_errors self.parsimony = parsimony diff --git a/src/flash_ansr/flash_ansr.py b/src/flash_ansr/flash_ansr.py index 6f22a44..e5ff241 100644 --- a/src/flash_ansr/flash_ansr.py +++ b/src/flash_ansr/flash_ansr.py @@ -202,7 +202,7 @@ class FlashANSR(BaseEstimator): perturbations. refiner_p0_noise_kwargs : dict or {'default'} or None, optional Keyword arguments forwarded to the noise sampler. ``'default'`` yields - ``{'low': -5, 'high': 5}`` for the uniform distribution. + ``{'loc': 0.0, 'scale': 5.0}`` for the normal distribution. numpy_errors : {'ignore', 'warn', 'raise', 'call', 'print', 'log'} or None, optional Desired NumPy error handling strategy applied during constant refinement. parsimony : float, optional @@ -261,7 +261,7 @@ def __init__( self.tokenizer = tokenizer if refiner_p0_noise_kwargs == 'default': - refiner_p0_noise_kwargs = {'low': -5, 'high': 5} + refiner_p0_noise_kwargs = {'loc': 0.0, 'scale': 5.0} if generation_config is None: generation_config = SoftmaxSamplingConfig() @@ -300,7 +300,7 @@ def load( cls, directory: str, generation_config: GenerationConfig | None = None, - n_restarts: int = 1, + n_restarts: int = 8, refiner_method: Literal[ 'curve_fit_lm', 'minimize_bfgs', @@ -333,7 +333,7 @@ def load( Distribution used to perturb initial constant guesses. refiner_p0_noise_kwargs : dict or {'default'} or None, optional Additional keyword arguments for the noise sampler. ``'default'`` - resolves to ``{'low': -5, 'high': 5}``. + resolves to ``{'loc': 0.0, 'scale': 5.0}``. numpy_errors : {'ignore', 'warn', 'raise', 'call', 'print', 'log'} or None, optional NumPy floating-point error policy applied during refinement. parsimony : float, optional diff --git a/src/flash_ansr/refine.py b/src/flash_ansr/refine.py index d49b3f5..f32ff3c 100644 --- a/src/flash_ansr/refine.py +++ b/src/flash_ansr/refine.py @@ -125,7 +125,7 @@ def fit( p0: np.ndarray | None = None, p0_noise: Literal['uniform', 'normal'] | None = 'normal', p0_noise_kwargs: dict | None = None, - n_restarts: int = 1, + n_restarts: int = 8, method: Literal[ 'curve_fit_lm', 'minimize_bfgs', diff --git a/tests/test_eval/test_evaluation.py b/tests/test_eval/test_evaluation.py index 2059b26..01b6bbb 100644 --- a/tests/test_eval/test_evaluation.py +++ b/tests/test_eval/test_evaluation.py @@ -17,7 +17,7 @@ from flash_ansr.expressions import SkeletonPool -MODEL = "psaegert/flash-ansr-v19.0-6M" +MODEL = "psaegert/flash-ansr-v23.0-3M" DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")