Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 24.3.3.0 #1211

Merged
merged 59 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
bf440f6
fix 1040
paxcema Jan 22, 2024
130fdfb
fix #1189: remove RnnEncoder, tests and helper methods
paxcema Jan 22, 2024
b3515c8
Merge pull request #1202 from mindsdb/fix/1040
paxcema Jan 22, 2024
9b59d6b
add unit mixer docs
paxcema Jan 22, 2024
71942c2
Merge pull request #1203 from mindsdb/fix/1189
paxcema Jan 22, 2024
ac894a6
Merge pull request #1204 from mindsdb/fix/1009
paxcema Jan 22, 2024
403e9a0
fix: switch GITHUB_TOKEN to CLA_TOKEN
paxcema Jan 23, 2024
6d00e36
fix: add explicit permissions to github token
lucas-koontz Jan 23, 2024
a1d682d
updated docs link
martyna-mindsdb Feb 28, 2024
2ca281b
Merge pull request #1205 from martyna-mindsdb/docs-link
ZoranPandovski Feb 29, 2024
1acbd75
fix: add Test() method for lightwood predictors
paxcema Feb 29, 2024
3e499bf
lint: flake8
paxcema Feb 29, 2024
7ce79c9
all metrics work via label encoding on the evaluator
paxcema Mar 5, 2024
a18f9cb
Merge pull request #1206 from mindsdb/fix_988
paxcema Mar 5, 2024
3e2fe67
Bump mindsdb_evaluator >= 0.0.12
paxcema Mar 5, 2024
a98ecc3
numeric encoder weight support
QuantumPlumber Mar 7, 2024
b6bb825
lightgbm numeric weight support
QuantumPlumber Mar 7, 2024
acf1e38
xgboost numeric weight support
QuantumPlumber Mar 7, 2024
1b80446
formatting and wip code duplication
QuantumPlumber Mar 7, 2024
1f1d7ba
wip index weights on numeric datasets
QuantumPlumber Mar 7, 2024
d11d4ed
Merge branch 'staging' into weighted-regression
QuantumPlumber Mar 11, 2024
a909ecf
wip unit testing
QuantumPlumber Mar 11, 2024
1a17bd2
pass target weights to numerical types
QuantumPlumber Mar 12, 2024
8ba9089
move dataset weighting into numeric encoder
QuantumPlumber Mar 12, 2024
d276182
wip testing
QuantumPlumber Mar 12, 2024
9e3a68f
poetry, first try
paxcema Mar 13, 2024
a14ce9f
fix extras
paxcema Mar 13, 2024
7f72387
fix optional deps
paxcema Mar 14, 2024
2bd11a7
add poetry.lock
paxcema Mar 14, 2024
1905f0a
add poetry dep
paxcema Mar 14, 2024
9e1df22
fix
paxcema Mar 14, 2024
f04b75f
fix
paxcema Mar 14, 2024
1a26db5
fix
paxcema Mar 14, 2024
395176a
fix
paxcema Mar 14, 2024
f6b1daa
fix
paxcema Mar 14, 2024
e6ce02f
fix
paxcema Mar 14, 2024
bc867c6
fix
paxcema Mar 14, 2024
b98b517
fix
paxcema Mar 14, 2024
d08f3b9
refresh poetry lock
paxcema Mar 14, 2024
453288e
refresh poetry lock
paxcema Mar 14, 2024
c9e2bc7
refresh poetry lock
paxcema Mar 14, 2024
619da88
refresh poetry lock
paxcema Mar 14, 2024
871ba29
wip xgboost weighting test.
QuantumPlumber Mar 14, 2024
5d06b98
lint: flake8
paxcema Mar 15, 2024
b6030f7
Merge pull request #1209 from mindsdb/fix/1207_pin_reqs
paxcema Mar 15, 2024
3cca968
wip numeric encoder as target
QuantumPlumber Mar 15, 2024
ac8bff6
wip numeric encoder as target
QuantumPlumber Mar 15, 2024
061de05
wip numeric encoder test
QuantumPlumber Mar 15, 2024
ca78b92
wip added lgihtgbm to requirements.txt
QuantumPlumber Mar 15, 2024
4cc2ee9
rm leftover folder
paxcema Mar 18, 2024
37358f9
version bump: 24.3.3.0
paxcema Mar 18, 2024
5aabc28
remove unused deps
paxcema Mar 18, 2024
1ede8ca
move build section
paxcema Mar 18, 2024
acbce9c
poetry lock
paxcema Mar 18, 2024
ae8b628
Merge pull request #1212 from mindsdb/fix/remove_unused_deps
paxcema Mar 18, 2024
8cc8a92
wip pr fixes
QuantumPlumber Mar 18, 2024
7adc8e8
Merge branch 'staging' into weighted-regression
QuantumPlumber Mar 18, 2024
665a130
wip skip lightgbm a different way to avoid flake9 error
QuantumPlumber Mar 18, 2024
8d1559c
Merge pull request #1210 from mindsdb/weighted-regression
QuantumPlumber Mar 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/doc_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ on:
jobs:
doc_build:
runs-on: ubuntu-latest
permissions:
contents: write

steps:
- name: checkout and set up
Expand Down
9 changes: 4 additions & 5 deletions .github/workflows/ligthtwood.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,8 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install --no-cache-dir -e .
pip install -r requirements_image.txt
pip install flake8
python -m pip install setuptools poetry
poetry install -E dev -E image
- name: Install dependencies OSX
run: |
if [ "$RUNNER_OS" == "macOS" ]; then
Expand All @@ -39,11 +38,11 @@ jobs:
CHECK_FOR_UPDATES: False
- name: Lint with flake8
run: |
python -m flake8 .
poetry run python -m flake8 .
- name: Test with unittest
run: |
# Run all the "standard" tests
python -m unittest discover tests
poetry run python -m unittest discover tests

deploy:
runs-on: ubuntu-latest
Expand Down
1 change: 0 additions & 1 deletion MANIFEST.in

This file was deleted.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ We predominantly use PyTorch based approaches, but can support other models.

## Usage

We invite you to check out our [documentation](https://lightwood.io) for specific guidelines and tutorials! Please stay tuned for updates and changes.
We invite you to check out our [documentation](https://mindsdb.github.io/lightwood/) for specific guidelines and tutorials! Please stay tuned for updates and changes.

### Quick use cases
Lightwood works with `pandas.DataFrames`. Once a DataFrame is loaded, defined a "ProblemDefinition" via a dictionary. The only thing a user needs to specify is the name of the column to predict (via the key `target`).
Expand Down
4 changes: 2 additions & 2 deletions docssrc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Lightwood works with a variety of data types such as numbers, dates, categories,

Our JSON-AI syntax allows users to change any and all parts of the models Lightwood automatically generates. The syntax outlines the specifics details in each step of the modeling pipeline. Users may override default values (for example, changing the type of a column) or alternatively, entirely replace steps with their own methods (ex: use a random forest model for a predictor). Lightwood creates a "JSON-AI" object from this syntax which can then be used to automatically generate python code to represent your pipeline.

For details as to how Lightwood works, check out the `Lightwood Philosophy <https://lightwood.io/lightwood_philosophy.html>`_ .
For details as to how Lightwood works, check out the `Lightwood Philosophy <https://mindsdb.github.io/lightwood/lightwood_philosophy.html>`_ .

Quick Guide
=======================
Expand Down Expand Up @@ -124,7 +124,7 @@ BYOM: Bring your own models

Lightwood supports user architectures/approaches so long as you follow the abstractions provided within each step.

Our `tutorials <https://lightwood.io/tutorials.html>`_ provide specific use cases for how to introduce customization into your pipeline. Check out "custom cleaner", "custom splitter", "custom explainer", and "custom mixer". Stay tuned for further updates.
Our `tutorials <https://mindsdb.github.io/lightwood/tutorials.html>`_ provide specific use cases for how to introduce customization into your pipeline. Check out "custom cleaner", "custom splitter", "custom explainer", and "custom mixer". Stay tuned for further updates.


Contribute to Lightwood
Expand Down
2 changes: 1 addition & 1 deletion lightwood/__about__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
__title__ = 'lightwood'
__package_name__ = 'lightwood'
__version__ = '23.12.4.0'
__version__ = '24.3.3.0'
__description__ = "Lightwood is a toolkit for automatic machine learning model building"
__email__ = "[email protected]"
__author__ = 'MindsDB Inc'
Expand Down
5 changes: 5 additions & 0 deletions lightwood/api/json_ai.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ def lookup_encoder(
"positive_domain"
] = "$statistical_analysis.positive_domain"

if problem_defintion.target_weights is not None:
encoder_dict["args"][
"target_weights"
] = problem_defintion.target_weights

# Time-series representations require more advanced flags
if tss.is_timeseries:
gby = tss.group_by if tss.group_by is not None else []
Expand Down
20 changes: 18 additions & 2 deletions lightwood/api/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ class PredictorInterface:
You can also use the predictor to now estimate new data:

- ``predict``: Deploys the chosen best model, and evaluates the given data to provide target estimates.
- ``test``: Similar to predict, but user also passes an accuracy function that will be used to compute a metric with the generated predictions.
- ``save``: Saves the Predictor object for further use.

The ``PredictorInterface`` is created via J{ai}son's custom code creation. A problem inherits from this class with pre-populated routines to fill out expected results, given the nature of each problem type.
Expand Down Expand Up @@ -127,12 +128,27 @@ def adjust(self, new_data: pd.DataFrame, old_data: Optional[pd.DataFrame] = None

def predict(self, data: pd.DataFrame, args: Dict[str, object] = {}) -> pd.DataFrame:
"""
Intakes raw data to provide predicted values for your trained model.
Intakes raw data to provide model predictions.

:param data: Data (n_samples, n_columns) that the model will use as input to predict the corresponding target value for each sample.
:param args: any parameters used to customize inference behavior. Wrapped as a ``PredictionArguments`` object.

:returns: A dataframe containing predictions and additional sample-wise information. `n_samples` rows.
""" # noqa
pass

def test(
self, data: pd.DataFrame, metrics: list, args: Dict[str, object] = {}, strict: bool = False
) -> pd.DataFrame:
"""
Intakes raw data to compute values for a list of provided metrics using a Lightwood predictor.

:param data: Data (n_samples, n_columns) that the model(s) will evaluate on and provide the target prediction.
:param metrics: A list of metrics to evaluate the model's performance on.
:param args: parameters needed to update the predictor ``PredictionArguments`` object, which holds any parameters relevant for prediction.
:param strict: If True, the function will raise an error if the model does not support any of the requested metrics. Otherwise it skips them.

:returns: A dataframe of predictions of the same length of input.
:returns: A dataframe with `n_metrics` columns, each cell containing the respective score of each metric.
""" # noqa
pass

Expand Down
2 changes: 1 addition & 1 deletion lightwood/data/encoded_ds.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def __init__(self, encoders: Dict[str, BaseEncoder], data_frame: pd.DataFrame, t

Note: normal behavior is to cache encoded representations to avoid duplicated computations. If you want an option to disable, this please open an issue.

:param encoders: list of Lightwood encoders used to encode the data per each column.
:param encoders: dictionary of Lightwood encoders used to encode the data per each column.
:param data_frame: original dataframe.
:param target: name of the target column to predict.
""" # noqa
Expand Down
3 changes: 1 addition & 2 deletions lightwood/encoder/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
from lightwood.encoder.array.ts_num_array import TsArrayNumericEncoder
from lightwood.encoder.text.short import ShortTextEncoder
from lightwood.encoder.text.vocab import VocabularyEncoder
from lightwood.encoder.text.rnn import RnnEncoder as TextRnnEncoder
from lightwood.encoder.categorical.simple_label import SimpleLabelEncoder
from lightwood.encoder.categorical.onehot import OneHotEncoder
from lightwood.encoder.categorical.binary import BinaryEncoder
Expand All @@ -22,7 +21,7 @@


__all__ = ['BaseEncoder', 'DatetimeEncoder', 'Img2VecEncoder', 'NumericEncoder', 'TsNumericEncoder',
'TsArrayNumericEncoder', 'ShortTextEncoder', 'VocabularyEncoder', 'TextRnnEncoder', 'OneHotEncoder',
'TsArrayNumericEncoder', 'ShortTextEncoder', 'VocabularyEncoder', 'OneHotEncoder',
'CategoricalAutoEncoder', 'TimeSeriesEncoder', 'ArrayEncoder', 'MultiHotEncoder', 'TsCatArrayEncoder',
'NumArrayEncoder', 'CatArrayEncoder', 'SimpleLabelEncoder',
'PretrainedLangEncoder', 'BinaryEncoder', 'DatetimeNormalizerEncoder', 'MFCCEncoder']
44 changes: 38 additions & 6 deletions lightwood/encoder/numeric/numeric.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import math
from typing import Union
from typing import Union, Dict
from copy import deepcopy as dc

import torch
import numpy as np
Expand All @@ -20,11 +21,15 @@ class NumericEncoder(BaseEncoder):
The ``absolute_mean`` is computed in the ``prepare`` method and is just the mean of the absolute values of all numbers feed to prepare (which are not none)

``none`` stands for any number that is an actual python ``None`` value or any sort of non-numeric value (a string, nan, inf)
""" # noqa
""" # noqa

def __init__(self, data_type: dtype = None, is_target: bool = False, positive_domain: bool = False):
def __init__(self, data_type: dtype = None,
target_weights: Dict[float, float] = None,
is_target: bool = False,
positive_domain: bool = False):
"""
:param data_type: The data type of the number (integer, float, quantity)
:param target_weights: a dictionary of weights to use on the examples.
:param is_target: Indicates whether the encoder refers to a target column or feature column (True==target)
:param positive_domain: Forces the encoder to always output positive values
"""
Expand All @@ -34,12 +39,19 @@ def __init__(self, data_type: dtype = None, is_target: bool = False, positive_do
self.decode_log = False
self.output_size = 4 if not self.is_target else 3

# Weight-balance info if encoder represents target
self.target_weights = None
self.index_weights = None
if self.is_target and target_weights is not None:
self.target_weights = dc(target_weights)
self.index_weights = torch.tensor(list(self.target_weights.values()))

def prepare(self, priming_data: pd.Series):
"""
"NumericalEncoder" uses a rule-based form to prepare results on training (priming) data. The averages etc. are taken from this distribution.

:param priming_data: an iterable data structure containing numbers numbers which will be used to compute the values used for normalizing the encoded representations
""" # noqa
""" # noqa
if self.is_prepared:
raise Exception('You can only call "prepare" once for a given encoder.')

Expand All @@ -57,7 +69,8 @@ def encode(self, data: Union[np.ndarray, pd.Series]):
if isinstance(data, pd.Series):
data = data.values

inp_data = np.nan_to_num(data.astype(float), nan=0, posinf=np.finfo(np.float32).max, neginf=np.finfo(np.float32).min) # noqa
inp_data = np.nan_to_num(data.astype(float), nan=0, posinf=np.finfo(np.float32).max,
neginf=np.finfo(np.float32).min) # noqa
if not self.positive_domain:
sign = np.vectorize(self._sign_fn, otypes=[float])(inp_data)
else:
Expand Down Expand Up @@ -97,7 +110,7 @@ def decode(self, encoded_values: torch.Tensor, decode_log: bool = None) -> list:
:param decode_log: Whether to decode the ``log`` or ``linear`` part of the representation, since the encoded vector contains both a log and a linear part

:returns: The decoded array
""" # noqa
""" # noqa

if not self.is_prepared:
raise Exception('You need to call "prepare" before calling "encode" or "decode".')
Expand Down Expand Up @@ -145,3 +158,22 @@ def decode(self, encoded_values: torch.Tensor, decode_log: bool = None) -> list:
ret[mask_none] = None

return ret.tolist() # TODO: update signature on BaseEncoder and replace all encs to return ndarrays

def get_weights(self, label_data):
# get a sorted list of intervals to assign weights. Keys are the interval edges.
target_weight_keys = np.array(list(self.target_weights.keys()))
target_weight_values = np.array(list(self.target_weights.values()))
sorted_indices = np.argsort(target_weight_keys)

# get sorted arrays for vector numpy operations
target_weight_keys = target_weight_keys[sorted_indices]
target_weight_values = target_weight_values[sorted_indices]

# find the indices of the bins according to the keys. clip to the length of the weight values (search sorted
# returns indices from 0 to N with N = len(target_weight_keys).
assigned_target_weight_indices = np.clip(a=np.searchsorted(target_weight_keys, label_data),
a_min=0,
a_max=len(target_weight_keys) - 1).astype(np.int32)

return target_weight_values[assigned_target_weight_indices]

3 changes: 1 addition & 2 deletions lightwood/encoder/text/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from lightwood.encoder.text.pretrained import PretrainedLangEncoder
from lightwood.encoder.text.rnn import RnnEncoder
from lightwood.encoder.text.tfidf import TfidfEncoder
from lightwood.encoder.text.short import ShortTextEncoder
from lightwood.encoder.text.vocab import VocabularyEncoder


__all__ = ['PretrainedLangEncoder', 'RnnEncoder', 'TfidfEncoder', 'ShortTextEncoder', 'VocabularyEncoder']
__all__ = ['PretrainedLangEncoder', 'TfidfEncoder', 'ShortTextEncoder', 'VocabularyEncoder']
46 changes: 0 additions & 46 deletions lightwood/encoder/text/helpers/pretrained_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
Basic helper functions for PretrainedLangEncoder
"""
import torch
from transformers import AdamW


class TextEmbed(torch.utils.data.Dataset):
Expand All @@ -26,48 +25,3 @@ def __getitem__(self, idx):

def __len__(self):
return len(self.labels)


def train_model(model, dataset, device, scheduler=None, log=None, optim=None, n_epochs=4):
"""
Generic training function, given an arbitrary model.

Given a model, train for n_epochs.

model - torch.nn model;
dataset - torch.DataLoader; dataset to train
device - torch.device; cuda/cpu
log - lightwood.logger.log; print output
optim - transformers.optimization.AdamW; optimizer
n_epochs - number of epochs to train

"""
if log is None:
from lightwood.helpers.log import log
log = log.debug
losses = []
model.train()
if optim is None:
optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(n_epochs):
total_loss = 0
for batch in dataset:
optim.zero_grad()

inpids = batch['input_ids'].to(device)
attn = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(inpids, attention_mask=attn, labels=labels)
loss = outputs[0]

total_loss += loss.item()

loss.backward()
optim.step()

if scheduler is not None:
scheduler.step()

log("Epoch", epoch + 1, "Loss", total_loss)
return model, losses
Loading
Loading