diff --git a/feature_store/README.md b/feature_store/README.md new file mode 100644 index 00000000..35ce4623 --- /dev/null +++ b/feature_store/README.md @@ -0,0 +1,65 @@ +# Oracle Feature Store (ADS) + +[![Python](https://img.shields.io/badge/python-3.8-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/oracle-ads/) [![PysparkConda](https://img.shields.io/badge/fspyspark32_p38_cpu_v2-1.0-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://docs.oracle.com/en-us/iaas/data-science/using/conda-pyspark-fam.htm) [![Notebook Examples](https://img.shields.io/badge/docs-notebook--examples-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/master/notebook_examples) [![Delta](https://img.shields.io/badge/delta-2.0.1-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://delta.io/) [![PySpark](https://img.shields.io/badge/pyspark-3.2.1-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://spark.apache.org/docs/3.2.1/api/python/index.html) [![Great Expectations](https://img.shields.io/badge/greatexpectations-0.17.19-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://greatexpectations.io/) [![Pandas](https://img.shields.io/badge/pandas-1.5.3-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://pandas.pydata.org/) [![PyArrow](https://img.shields.io/badge/pyarrow-11.0.0-blue?style=for-the-badge&logo=pypi&logoColor=white)](https://arrow.apache.org/docs/python/index.html) + +Managing many datasets, data sources, and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store solves many of the problems because it is a centralized way to transform and access data for training and serving time, Feature stores help define a standardised pipeline for ingestion of data and querying of data. + +ADS feature store is a stack-based solution that is deployed in your tenancy using OCI Resource Manager. + +Following are brief descriptions of key concepts and the main components of ADS feature store. + +- ``Feature Vector``: Set of feature values for any one primary and identifier key. For example, all and a subset of features of customer ID 2536 can be called as one feature vector . +- ``Feature``: A feature is an individual measurable property or characteristic of an event being observed. +- ``Entity``: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated with features. Another way to look at it is that an entity is an object or concept that's described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on. +- ``Feature Group``: A feature group in a feature store is a collection of related features that are often used together in ML models. It serves as an organizational unit within the feature store for users to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering. +- ``Feature Group Job``: Feature group jobs are the processing instance of a feature group. Each feature group job includes validation results and statistics results. +- ``Dataset``: A dataset is a collection of features that are used together to either train a model or perform model inference. +- ``Dataset Job``: A dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results. + +## Documentation + + - [Oracle Feature Store SDK (ADS) Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/) + - [OCI Data Science and AI services Examples](https://github.com/oracle/oci-data-science-ai-samples) + - [Oracle AI & Data Science Blog](https://blogs.oracle.com/ai-and-datascience/) + - [OCI Documentation](https://docs.oracle.com/en-us/iaas/data-science/using/data-science.htm) + +## Examples + +### Quick start examples + +| Jupyter Notebook | Description | +|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Feature store querying](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_querying.ipynb) | - Ingestion, querying and exploration of data. | +| [Feature store quickstart](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_quickstart.ipynb) | - Ingestion, querying and exploration of data. | +| [Schema enforcement and schema evolution](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_schema_evolution.ipynb) | - `Schema evolution` allows you to easily change a table's current schema to accommodate data that is changing over time. `Schema enforcement`, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that don't match the table's schema. | +| [Storage of medical records in feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_ehr_data.ipynb) | Example to demonstrate storage of medical records in feature store | + +### Big data operations using OCI DataFlow + +| Jupyter Notebook | Description | +|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------| +| [Big data operations with feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_spark_magic.ipynb) | - Ingestion of data using Spark Magic, querying and exploration of data using Spark Magic. | + +### LLM Use cases + +| Jupyter Notebook | Description | +|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Embeddings in Feature Store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_embeddings.ipynb) | - `Embedding feature stores` are optimized for fast and efficient retrieval of embeddings. This is important because embeddings can be high-dimensional and computationally expensive to calculate. By storing them in a dedicated store, you can avoid the need to recalculate embeddings for the same data repeatedly. | +| [Synthetic data generation in feature store using OpenAI and FewShotPromptTemplate](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb) | - `Synthetic data` is artificially generated data, rather than data collected from real-world events. It's used to simulate real data without compromising privacy or encountering real-world limitations. | +| [PII Data redaction, Summarise Content and Translate content using doctran and open AI](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb) | - One way to think of Doctran is a LLM-powered black box where messy strings go in and nice, clean, labelled strings come out. Another way to think about it is a modular, declarative wrapper over OpenAI's functional calling feature that significantly improves the developer experience. | +| [OpenAI embeddings in feature store](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/notebook_examples/feature_store_embeddings_openai.ipynb) | - `Embedding feature stores` are optimized for fast and efficient retrieval of embeddings. This is important because embeddings can be high-dimensional and computationally expensive to calculate. By storing them in a dedicated store, you can avoid the need to recalculate embeddings for the same data repeatedly. | + + +## Contributing + +This project welcomes contributions from the community. Before submitting a pull request, please [review our contribution guide](./../../CONTRIBUTING.md) + +Find Getting Started instructions for developers in [README-development.md](https://github.com/oracle/accelerated-data-science/blob/main/README-development.md) + +## Security + +Consult the security guide [SECURITY.md](https://github.com/oracle/accelerated-data-science/blob/main/SECURITY.md) for our responsible security vulnerability disclosure process. + +## License + +Copyright (c) 2020, 2022 Oracle and/or its affiliates. Licensed under the [Universal Permissive License v1.0](https://oss.oracle.com/licenses/upl/) \ No newline at end of file diff --git a/feature_store/feature_store_creation_ingestion_with_jobs/README.md b/feature_store/feature_store_creation_ingestion_with_jobs/README.md new file mode 100644 index 00000000..bd55ad7e --- /dev/null +++ b/feature_store/feature_store_creation_ingestion_with_jobs/README.md @@ -0,0 +1,31 @@ +Feature Store Creation and Ingestion using ML Job +===================== + +In this Example, you use the Oracle Cloud Infrastructure (OCI) Data Science service MLJob component to create OCI Feature store design time constructs and then ingest feature values into the offline feature store. + +Tutorial picks use case of Electronic Heath Data consisting of Patient Test Results. The example demonstrates creation of feature store, entity , transformation and feature group design time constructs using a python script which is provided as job artifact. Another job artifact demonstrates ingestion of feature values into pre-created feature group. + +# Prerequisites + +The notebook makes connections to other OCI resources. This is done using [resource principals](https://docs.oracle.com/en-us/iaas/Content/Functions/Tasks/functionsaccessingociresources.htm). If you have not configured your tenancy to use resource principals then you can do so using the instructions that are [here](https://docs.oracle.com/en-us/iaas/data-science/using/create-dynamic-groups.htm). Alternatively, you can use API keys. The preferred method for authentication is resource principals. + + +# Instructions + +1. Open a Data Science Notebook session (i.e. JupyterLab). +2. Open a file terminal by clicking on File -> New -> Terminal. +3. In the terminal run the following commands: +4. `odsc conda install -s fspyspark32_p38_cpu_v1` to install the feature store conda. + 1. `conda activate /home/datascience/conda/fspyspark32_p38_cpu_v1` to activate the conda. +5. Copy the `notebooks` folder into the notebook session. +6. Open the notebook `notebook/feature_store_using_mljob.ipynb`. +7. Change the notebook kernel to `Python [conda env:fspyspark32_p38_cpu_v1]`. +8. Read the notebook and execute each cell. +9. Once the ml job run is completed successfully, user can validate creation of feature store construct using the feature store notebook ui extension. +10. Now open the notebook `notebook/feature_store_ingestion_via_mljob.ipynb`. +11. Change the notebook kernel to `Python [conda env:fspyspark32_p38_cpu_v1]`. +12. Read the notebook and execute each cell. +13. validate the ingestion ml job is executed successfully. +14. User can validate the ingested data and other metadata using the feature store notebook ui extension. + + diff --git a/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_creation_example.py b/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_creation_example.py new file mode 100644 index 00000000..8b3fd3c7 --- /dev/null +++ b/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_creation_example.py @@ -0,0 +1,157 @@ +import ads +import os +import pandas as pd +from ads.feature_store.feature_store import FeatureStore +from ads.feature_store.feature_group import FeatureGroup +from ads.feature_store.model_details import ModelDetails +from ads.feature_store.dataset import Dataset +from ads.feature_store.common.enums import DatasetIngestionMode +from ads.feature_store.common.enums import ExpectationType +from ads.feature_store.transformation import Transformation,TransformationMode + +from ads.feature_store.feature_group_expectation import ExpectationType +from great_expectations.core import ExpectationSuite, ExpectationConfiguration +from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar +from ads.feature_store.feature_group import FeatureGroup +from ads.feature_store.common.enums import FeatureType +from ads.feature_store.input_feature_detail import FeatureDetail + +COMPARTMENT_ID = "COMPARTMENT_ID" +METASTORE_ID = "METASTORE_ID" +SERVICE_ENDPOINT = "SERVICE_ENDPOINT" + +print("Initiating feature store lazy entities creation") + +compartment_id = os.environ.get(COMPARTMENT_ID, "ocid1.compartment...none") +metastore_id = os.environ.get(METASTORE_ID, "ocid1.metastore...none") +service_endpoint = os.environ.get(SERVICE_ENDPOINT, "") + +ads.set_auth(auth="resource_principal", client_kwargs={"fs_service_endpoint": service_endpoint}) + +patient_result_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/EHR/data-ori.csv") + +print(f"The dataset contains {patient_result_df.shape[0]} rows and {patient_result_df.shape[1]} columns") + +# get all the features +features = [feat for feat in patient_result_df.columns if feat !='SOURCE'] +num_features = [feat for feat in features if patient_result_df[feat].dtype != object] +cat_features = [feat for feat in features if patient_result_df[feat].dtype == object] + +print(f"Total number of features : {len(features)}") +print(f"Number of numerical features : {len(num_features)}") +print(f"Number of categorical features : {len(cat_features)}\n") +print(patient_result_df.isna().mean().to_frame(name='Missing %')) +print(patient_result_df.nunique().to_frame(name='# of unique values')) +feature_store_resource = ( + FeatureStore(). + with_description("Electronic Heath Data consisting of Patient Test Results"). + with_compartment_id(compartment_id). + with_name("EHR details MLJob"). + with_offline_config(metastore_id=metastore_id) +) +feature_store = feature_store_resource.create() +print(feature_store) +entity = feature_store.create_entity( + name="EHR", + description="Electronic Health Record predictions" +) +print(entity) + +def chained_transformation(patient_result_df, **transformation_args): + def label_encoder_transformation(patient_result_df, **transformation_args): + from sklearn.preprocessing import LabelEncoder + # creating instance of labelencoder + labelencoder = LabelEncoder() + result_df = patient_result_df.copy() + column_labels= transformation_args.get("label_encode_column") + if isinstance(column_labels,list): + for col in column_labels: + result_df[col] = labelencoder.fit_transform(result_df[col]) + elif isinstance(column_labels, str): + result_df[column_labels] = labelencoder.fit_transform(result_df[column_labels]) + else: + return None + return result_df + + def min_max_scaler(patient_result_df, **transformation_args): + from sklearn.preprocessing import MinMaxScaler + final_result_df = patient_result_df.copy() + scaler = MinMaxScaler(feature_range=(0, 1)) + column_labels= transformation_args.get("scaling_column_labels") + final_result_df[column_labels] = scaler.fit_transform(final_result_df[column_labels]) + return patient_result_df + + def feature_removal(input_df, **transformation_args): + output_df = input_df.copy() + output_df.drop(transformation_args.get("redundant_feature_label"), axis=1, inplace=True) + return output_df + + out1 = label_encoder_transformation(patient_result_df, **transformation_args) + out2 = min_max_scaler(out1, **transformation_args) + return feature_removal(out2, **transformation_args) + +transformation_args = { + "label_encode_column": ["SEX","SOURCE"], + "scaling_column_labels": num_features, + "redundant_feature_label": ["MCH", "MCHC", "MCV"] +} + +transformation = ( + Transformation() + .with_name("chained_transformation") + .with_feature_store_id(feature_store.id) + .with_source_code_function(chained_transformation) + .with_transformation_mode(TransformationMode.PANDAS) + .with_description("transformation to perform feature engineering") + .with_compartment_id(compartment_id) +) + +transformation.create() + +input_feature_details_ehr = [ + FeatureDetail("HAEMATOCRIT").with_feature_type(FeatureType.DOUBLE).with_order_number(1), + FeatureDetail("HAEMOGLOBINS").with_feature_type(FeatureType.DOUBLE).with_order_number(2), + FeatureDetail("LEUCOCYTE").with_feature_type(FeatureType.DOUBLE).with_order_number(3), + FeatureDetail("THROMBOCYTE").with_feature_type(FeatureType.LONG).with_order_number(4), + FeatureDetail("MCH").with_feature_type(FeatureType.DOUBLE).with_order_number(5), + FeatureDetail("MCHC").with_feature_type(FeatureType.DOUBLE).with_order_number(5), + FeatureDetail("MCV").with_feature_type(FeatureType.DOUBLE).with_order_number(6), + FeatureDetail("AGE").with_feature_type(FeatureType.LONG).with_order_number(7), + FeatureDetail("SEX").with_feature_type(FeatureType.STRING).with_order_number(8), + FeatureDetail("SOURCE").with_feature_type(FeatureType.STRING).with_order_number(9), + ] + +feature_group_ehr = ( + FeatureGroup() + .with_feature_store_id(feature_store.id) + .with_primary_keys([]) + .with_name("ehr_feature_group_mljob") + .with_entity_id(entity.id) + .with_compartment_id(compartment_id) + .with_input_feature_details(input_feature_details_ehr) + .with_transformation_id(transformation.id) + .with_transformation_kwargs(transformation_args) +) +feature_group_ehr.create() + + +print(feature_group_ehr.id) + +expectation_suite_ehr = ExpectationSuite( + expectation_suite_name="test_hcm_df" +) +expectation_suite_ehr.add_expectation( + ExpectationConfiguration( + expectation_type="expect_column_values_to_not_be_null", + kwargs={"column": "AGE"}, + ) +) +expectation_suite_ehr.add_expectation( + ExpectationConfiguration( + expectation_type="expect_column_values_to_be_between", + kwargs={"column": "HAEMOGLOBINS", "min_value": 0, "max_value": 30}, + ) +) + +feature_group_ehr.with_expectation_suite(expectation_suite_ehr, expectation_type = ExpectationType.STRICT) +feature_group_ehr.update() \ No newline at end of file diff --git a/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_ingestion.py b/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_ingestion.py new file mode 100644 index 00000000..0119095d --- /dev/null +++ b/feature_store/feature_store_creation_ingestion_with_jobs/feature_store_ingestion.py @@ -0,0 +1,51 @@ +import subprocess +import os +import shutil + +from ads.feature_store.feature_group import FeatureGroup +import pandas as pd +import ads + +home_dir = os.path.expanduser("~") +spark_conf_dir = os.path.join(home_dir, "spark_conf_dir") +common_jars_dir = os.path.join(spark_conf_dir, "common-jars") +datacatalog_metastore_client_jars_dir = os.path.join(spark_conf_dir, "datacatalog-metastore-client-jars") + +conda_prefix = os.environ.get("CONDA_PREFIX") + +os.environ["SPARK_CONF_DIR"] = spark_conf_dir +os.makedirs(spark_conf_dir, exist_ok=True) + +shutil.copytree(f"{conda_prefix}/common-jars", common_jars_dir) +shutil.copytree(f"{conda_prefix}/datacatalog-metastore-client-jars", datacatalog_metastore_client_jars_dir) + +shutil.copy(f"{conda_prefix}/spark-defaults.conf", os.path.join(spark_conf_dir, "spark-defaults.conf")) +shutil.copy(f"{conda_prefix}/core-site.xml", os.path.join(spark_conf_dir, "core-site.xml")) +shutil.copy(f"{conda_prefix}/log4j.properties", os.path.join(spark_conf_dir, "log4j.properties")) + +COMPARTMENT_ID = "COMPARTMENT_ID" +METASTORE_ID = "METASTORE_ID" +SERVICE_ENDPOINT = "SERVICE_ENDPOINT" +FEATURE_GROUP_ID = "FEATURE_GROUP_ID" + +print("Initiating feature store lazy entities creation") + +compartment_id = os.environ.get(COMPARTMENT_ID, "ocid1.compartment...none") +metastore_id = os.environ.get(METASTORE_ID, "ocid1.metastore...none") +service_endpoint = os.environ.get(SERVICE_ENDPOINT, "") +feature_group_id = os.environ.get(FEATURE_GROUP_ID, "") + +print(subprocess.run(["odsc", + "data-catalog", + "config", + "--authentication", + "resource_principal", + "--metastore", + metastore_id],capture_output=True)) + +ads.set_auth(auth="resource_principal", client_kwargs={"fs_service_endpoint": service_endpoint}) + +ehr_feature_group = FeatureGroup.from_id(feature_group_id) +patient_result_df = pd.read_csv("https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/EHR/data-ori.csv") +if ehr_feature_group: + ehr_feature_group.materialise(patient_result_df) \ No newline at end of file diff --git a/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_ingestion_via_mljob.ipynb b/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_ingestion_via_mljob.ipynb new file mode 100644 index 00000000..0c778a69 --- /dev/null +++ b/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_ingestion_via_mljob.ipynb @@ -0,0 +1,246 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "150bf66b", + "metadata": {}, + "source": [ + "### OCI Data Science - Useful Tips\n", + "
\n", + "Check for Public Internet Access\n", + "\n", + "```python\n", + "import requests\n", + "response = requests.get(\"https://oracle.com\")\n", + "assert response.status_code==200, \"Internet connection failed\"\n", + "```\n", + "
\n", + "
\n", + "Helpful Documentation \n", + "\n", + "
\n", + "
\n", + "Typical Cell Imports and Settings for ADS\n", + "\n", + "```python\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%matplotlib inline\n", + "\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "import logging\n", + "logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)\n", + "\n", + "import ads\n", + "from ads.dataset.factory import DatasetFactory\n", + "from ads.automl.provider import OracleAutoMLProvider\n", + "from ads.automl.driver import AutoML\n", + "from ads.evaluations.evaluator import ADSEvaluator\n", + "from ads.common.data import ADSData\n", + "from ads.explanations.explainer import ADSExplainer\n", + "from ads.explanations.mlx_global_explainer import MLXGlobalExplainer\n", + "from ads.explanations.mlx_local_explainer import MLXLocalExplainer\n", + "from ads.catalog.model import ModelCatalog\n", + "from ads.common.model_artifact import ModelArtifact\n", + "```\n", + "
\n", + "
\n", + "Useful Environment Variables\n", + "\n", + "```python\n", + "import os\n", + "print(os.environ[\"NB_SESSION_COMPARTMENT_OCID\"])\n", + "print(os.environ[\"PROJECT_OCID\"])\n", + "print(os.environ[\"USER_OCID\"])\n", + "print(os.environ[\"TENANCY_OCID\"])\n", + "print(os.environ[\"NB_REGION\"])\n", + "```\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "b512479a", + "metadata": {}, + "source": [ + "# Using Data Science Jobs to ingest feature values in to OCI feature store\n", + "

by the Oracle Cloud Infrastructure Data Science Team

\n", + "\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "daf36be8", + "metadata": {}, + "source": [ + "# Introduction \n", + "\n", + "Data Science Jobs allow you to run customized tasks outside of a notebook session. You can have Compute on demand and only pay for the Compute that you need. With jobs, you can run applications that perform tasks such as data preparation, model training, hyperparameter tuning, and batch inference. When the task is complete, the compute automatically terminates. You can use the Logging service to capture output messages. In this notebook, we will use the Accelerated Data Science SDK (ADS) to help us define a Data Science Job to create design time entities of OCI feature store which can later be used to ingest feature data.\n", + "\n", + "For more information on using ADS for jobs, you can go to our [documentation](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/jobs/index.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "359187c1", + "metadata": {}, + "outputs": [], + "source": [ + "from ads.jobs import Job\n", + "from ads.jobs import DataScienceJob, ScriptRuntime\n", + "import ads\n", + "\n", + "ads.set_auth('resource_principal')" + ] + }, + { + "cell_type": "markdown", + "id": "29fee819", + "metadata": {}, + "source": [ + "## Infrastructure\n", + "\n", + "Data Science Job infrastructure is defined by a `DataScienceJob` instance. \n", + "Important: If you want to use logging for the job, fill in the `log_group_id` and `log_id` in the cell below. You need to have set up the policies for the logging service. For more information about setting up logs for a job, you can go to our [documentation](https://docs.oracle.com/en-us/iaas/data-science/using/log-about.htm#jobs_about__job-logs)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74af1cf3", + "metadata": {}, + "outputs": [], + "source": [ + "infrastructure = (\n", + " DataScienceJob()\n", + " .with_shape_name(\"VM.Standard2.24\")\n", + " .with_block_storage_size(50)\n", + " .with_log_group_id(\"\")\n", + " .with_log_id(\"\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a4c45f54", + "metadata": {}, + "source": [ + "## Job Runtime\n", + "\n", + "`ScriptRuntime` allows you to run Python, Bash, and Java scripts from a single source file (.zip or .tar.gz) or code directory. You can configure a Data Science Conda Environment for running your code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18703e6b", + "metadata": {}, + "outputs": [], + "source": [ + "runtime = (\n", + " ScriptRuntime()\n", + " .with_source(\"./feature_store_ingestion.py\")\n", + " .with_service_conda(\"fspyspark32_p38_cpu_v3\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "aec991ec", + "metadata": {}, + "source": [ + "## Define Job\n", + "\n", + "With runtime and infrastructure, you can define a job and give it a name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f378277", + "metadata": {}, + "outputs": [], + "source": [ + "import time \n", + "epoch_time = int(time.time())\n", + "print(f'fs_ingestion_{epoch_time}')\n", + "job = Job(name= f'fs_ingestion_{epoch_time}').with_infrastructure(infrastructure).with_runtime(runtime)" + ] + }, + { + "cell_type": "markdown", + "id": "46174975", + "metadata": {}, + "source": [ + "## Create and Run Job\n", + "\n", + "You can call the `create()` method of a job instance to create a job. After the job is created, you can call the `run()` method to create and start a job run. The `run()` method returns a `DataScienceJobRun`. You can monitor the job run output by calling the `watch()` method of the `DataScienceJobRun` instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0956fbf9", + "metadata": {}, + "outputs": [], + "source": [ + "job.create()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bba577a5", + "metadata": {}, + "outputs": [], + "source": [ + "job_run = job.run()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b016f802", + "metadata": {}, + "outputs": [], + "source": [ + "job_run.watch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8c4929e", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_using_mljob.ipynb b/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_using_mljob.ipynb new file mode 100644 index 00000000..0eadb073 --- /dev/null +++ b/feature_store/feature_store_creation_ingestion_with_jobs/notebook/feature_store_using_mljob.ipynb @@ -0,0 +1,249 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a204cdc2", + "metadata": {}, + "source": [ + "### OCI Data Science - Useful Tips\n", + "
\n", + "Check for Public Internet Access\n", + "\n", + "```python\n", + "import requests\n", + "response = requests.get(\"https://oracle.com\")\n", + "assert response.status_code==200, \"Internet connection failed\"\n", + "```\n", + "
\n", + "
\n", + "Helpful Documentation \n", + "\n", + "
\n", + "
\n", + "Typical Cell Imports and Settings for ADS\n", + "\n", + "```python\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "%matplotlib inline\n", + "\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "import logging\n", + "logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)\n", + "\n", + "import ads\n", + "from ads.dataset.factory import DatasetFactory\n", + "from ads.automl.provider import OracleAutoMLProvider\n", + "from ads.automl.driver import AutoML\n", + "from ads.evaluations.evaluator import ADSEvaluator\n", + "from ads.common.data import ADSData\n", + "from ads.explanations.explainer import ADSExplainer\n", + "from ads.explanations.mlx_global_explainer import MLXGlobalExplainer\n", + "from ads.explanations.mlx_local_explainer import MLXLocalExplainer\n", + "from ads.catalog.model import ModelCatalog\n", + "from ads.common.model_artifact import ModelArtifact\n", + "```\n", + "
\n", + "
\n", + "Useful Environment Variables\n", + "\n", + "```python\n", + "import os\n", + "print(os.environ[\"NB_SESSION_COMPARTMENT_OCID\"])\n", + "print(os.environ[\"PROJECT_OCID\"])\n", + "print(os.environ[\"USER_OCID\"])\n", + "print(os.environ[\"TENANCY_OCID\"])\n", + "print(os.environ[\"NB_REGION\"])\n", + "```\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "43929671", + "metadata": {}, + "source": [ + "# Using Data Science Jobs to create design time entities of OCI feature store\n", + "

by the Oracle Cloud Infrastructure Data Science Team

\n", + "\n", + "***" + ] + }, + { + "cell_type": "markdown", + "id": "26506eac", + "metadata": {}, + "source": [ + "# Introduction \n", + "\n", + "Data Science Jobs allow you to run customized tasks outside of a notebook session. You can have Compute on demand and only pay for the Compute that you need. With jobs, you can run applications that perform tasks such as data preparation, model training, hyperparameter tuning, and batch inference. When the task is complete, the compute automatically terminates. You can use the Logging service to capture output messages. In this notebook, we will use the Accelerated Data Science SDK (ADS) to help us define a Data Science Job to create design time entities of OCI feature store which can later be used to ingest feature data.\n", + "\n", + "For more information on using ADS for jobs, you can go to our [documentation](https://docs.oracle.com/en-us/iaas/tools/ads-sdk/latest/user_guide/jobs/index.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6829fabd", + "metadata": {}, + "outputs": [], + "source": [ + "from ads.jobs import Job\n", + "from ads.jobs import DataScienceJob, ScriptRuntime\n", + "import ads \n", + "client_kwargs = dict(\n", + " fs_service_endpoint=\"\",\n", + ")\n", + "\n", + "ads.set_auth('resource_principal',client_kwargs=client_kwargs)" + ] + }, + { + "cell_type": "markdown", + "id": "291a0dcb", + "metadata": {}, + "source": [ + "## Infrastructure\n", + "\n", + "Data Science Job infrastructure is defined by a `DataScienceJob` instance. \n", + "Important: If you want to use logging for the job, fill in the `log_group_id` and `log_id` in the cell below. You need to have set up the policies for the logging service. For more information about setting up logs for a job, you can go to our [documentation](https://docs.oracle.com/en-us/iaas/data-science/using/log-about.htm#jobs_about__job-logs)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97a3089b", + "metadata": {}, + "outputs": [], + "source": [ + "infrastructure = (\n", + " DataScienceJob()\n", + " .with_shape_name(\"VM.Standard2.24\")\n", + " .with_block_storage_size(50)\n", + " .with_log_group_id(\"\")\n", + " .with_log_id(\"\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "880938cb", + "metadata": {}, + "source": [ + "## Job Runtime\n", + "\n", + "`ScriptRuntime` allows you to run Python, Bash, and Java scripts from a single source file (.zip or .tar.gz) or code directory. You can configure a Data Science Conda Environment for running your code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4e6ae88", + "metadata": {}, + "outputs": [], + "source": [ + "runtime = (\n", + " ScriptRuntime()\n", + " .with_source(\"./feature_store_creation_example.py\")\n", + " .with_service_conda(\"fspyspark32_p38_cpu_v3\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5ebaed82", + "metadata": {}, + "source": [ + "## Define Job\n", + "\n", + "With runtime and infrastructure, you can define a job and give it a name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6049898", + "metadata": {}, + "outputs": [], + "source": [ + "import time \n", + "epoch_time = int(time.time())\n", + "print(f'fs_dt_creation_{epoch_time}')\n", + "job = Job(name= f'fs_dt_creation_{epoch_time}').with_infrastructure(infrastructure).with_runtime(runtime)" + ] + }, + { + "cell_type": "markdown", + "id": "ba4a5a95", + "metadata": {}, + "source": [ + "## Create and Run Job\n", + "\n", + "You can call the `create()` method of a job instance to create a job. After the job is created, you can call the `run()` method to create and start a job run. The `run()` method returns a `DataScienceJobRun`. You can monitor the job run output by calling the `watch()` method of the `DataScienceJobRun` instance." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b8382b5", + "metadata": {}, + "outputs": [], + "source": [ + "job.create()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "11330367", + "metadata": {}, + "outputs": [], + "source": [ + "job_run = job.run()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3fcf6ad3", + "metadata": {}, + "outputs": [], + "source": [ + "job_run.watch()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e69266ce", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebook_examples/_artifacts/conda/fspyspark32_p38_cpu_v2.txt b/notebook_examples/_artifacts/conda/fspyspark32_p38_cpu_v2.txt index 07246c3f..4698c3bd 100644 --- a/notebook_examples/_artifacts/conda/fspyspark32_p38_cpu_v2.txt +++ b/notebook_examples/_artifacts/conda/fspyspark32_p38_cpu_v2.txt @@ -1,4 +1,4 @@ -# packages in environment at /home/datascience/conda/fspyspark32_p38_cpu_v2: +# packages in environment at /home/datascience/conda/fspyspark32_p38_cpu_v3: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge diff --git a/notebook_examples/feature_store_ehr_data.ipynb b/notebook_examples/feature_store_ehr_data.ipynb index 7643885c..fdd0943e 100644 --- a/notebook_examples/feature_store_ehr_data.ipynb +++ b/notebook_examples/feature_store_ehr_data.ipynb @@ -8,24 +8,22 @@ "qweews@notebook{feature_store-querying.ipynb,\n", " title: Using feature store for feature querying using pandas like interface for query and join,\n", " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" ] }, { - "cell_type": "raw", - "id": "9efb9345", - "metadata": { - "ExecuteTime": { - "end_time": "2023-05-24T08:26:08.572567Z", - "start_time": "2023-05-24T08:26:08.328013Z" - } - }, + "cell_type": "code", + "execution_count": null, + "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" - ] + "!odsc conda install -s fspyspark32_p38_cpu_v3" + ], + "metadata": { + "collapsed": false + } }, { "cell_type": "code", @@ -395,7 +393,7 @@ " FeatureStore().\n", " with_description(\"Electronic Heath Data consisting of Patient Test Results\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"EHR details\").\n", + " with_name(\"EHR details\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -443,7 +441,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"EHR\",\n", + " name=\"EHR\",\n", " description=\"Electronic Health Record predictions\"\n", ")\n", "entity" @@ -534,7 +532,7 @@ "\n", "transformation = (\n", " Transformation()\n", - " .with_display_name(\"chained_transformation\")\n", + " .with_name(\"chained_transformation\")\n", " .with_feature_store_id(feature_store.id)\n", " .with_source_code_function(chained_transformation)\n", " .with_transformation_mode(TransformationMode.PANDAS)\n", diff --git a/notebook_examples/feature_store_embeddings.ipynb b/notebook_examples/feature_store_embeddings.ipynb index 6ddd17ca..ea138e78 100644 --- a/notebook_examples/feature_store_embeddings.ipynb +++ b/notebook_examples/feature_store_embeddings.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store_embeddings.ipynb,\n", " title: Using feature store for storage, retrieval, versioning and time travel of embeddings,\n", " summary: Feature store to store embeddings, version embeddings and time travel of embeddings.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, llm, embeddings,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -370,7 +370,7 @@ " FeatureStore().\n", " with_description(\"SQUAD Dataset Feature Store\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"SQUAD details\").\n", + " with_name(\"SQUAD details\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -418,7 +418,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Squad Entity\",\n", + " name=\"Squad Entity\",\n", " description=\"description for Squad entity details\"\n", ")\n", "entity" @@ -523,7 +523,7 @@ "squad_transformation = feature_store.create_transformation(\n", " transformation_mode=TransformationMode.PANDAS,\n", " source_code_func=squad_embedding_transformation,\n", - " display_name=\"squad_embedding_transformation\",\n", + " name=\"squad_embedding_transformation\",\n", ")\n", "\n", "squad_transformation" diff --git a/notebook_examples/feature_store_embeddings_openai.ipynb b/notebook_examples/feature_store_embeddings_openai.ipynb index edbc0004..6f5e16a2 100644 --- a/notebook_examples/feature_store_embeddings_openai.ipynb +++ b/notebook_examples/feature_store_embeddings_openai.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store-querying.ipynb,\n", " title: Using feature store for feature querying using pandas like interface for query and join,\n", " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -348,7 +348,7 @@ " FeatureStore().\n", " with_description(\"Feature Store embeddings\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"Feature Store embeddings\").\n", + " with_name(\"Feature Store embeddings\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -396,7 +396,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Feature Store embeddings\",\n", + " name=\"Feature Store embeddings\",\n", " description=\"Feature Store embeddings\"\n", ")\n", "entity" @@ -459,7 +459,7 @@ "embedding_transformation = feature_store.create_transformation(\n", " transformation_mode=TransformationMode.PANDAS,\n", " source_code_func=openai_generate_embedding_data,\n", - " display_name=\"openai_generate_embedding_data\",\n", + " name=\"openai_generate_embedding_data\",\n", ")\n", "\n", "embedding_transformation" diff --git a/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb b/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb index 42fff8df..b1377b33 100644 --- a/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb +++ b/notebook_examples/feature_store_medical_synthetic_data_openai.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store-querying.ipynb,\n", " title: Using feature store for synthetic data generation using openai,\n", " summary: Feature store quickstart guide to perform synthetic data generation using openai,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying, synthetic data generation\n", " license: Universal Permissive License v 1.0\n", "}" @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -350,7 +350,7 @@ " FeatureStore().\n", " with_description(\"Medical Synthetic Data Feature Store\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"Synthetic data details\").\n", + " with_name(\"Synthetic data details\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -398,7 +398,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Synthetic Medical Entity\",\n", + " name=\"Synthetic Medical Entity\",\n", " description=\"description for medical entity details\"\n", ")\n", "entity" @@ -516,7 +516,7 @@ "synthetic_transformation = feature_store.create_transformation(\n", " transformation_mode=TransformationMode.PANDAS,\n", " source_code_func=generate_synthetic_data,\n", - " display_name=\"generate_synthetic_data\",\n", + " name=\"generate_synthetic_data\",\n", ")\n", "\n", "synthetic_transformation" diff --git a/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb b/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb index af2e833e..eddca1ed 100644 --- a/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb +++ b/notebook_examples/feature_store_pii_redaction_and_transformation.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store_pii_redaction_and_transformation.ipynb,\n", " title: Use feature store to perform PII data redaction, summarization, translation using openai,\n", " summary: Use feature store to perform PII data redaction, summarization, translation using openai.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -26,7 +26,7 @@ }, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -410,7 +410,7 @@ " FeatureStore().\n", " with_description(\"Data Redaction Feature Store\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"Data redaction\").\n", + " with_name(\"Data redaction\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -458,7 +458,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Data Redaction Feature Store\",\n", + " name=\"Data Redaction Feature Store\",\n", " description=\"Data Redaction Feature Store\"\n", ")\n", "entity" @@ -519,7 +519,7 @@ "pii_data_redaction_transformation = feature_store.create_transformation(\n", " transformation_mode=TransformationMode.PANDAS,\n", " source_code_func=transform_dataframe,\n", - " display_name=\"transform_dataframe\",\n", + " name=\"transform_dataframe\",\n", ")\n", "\n", "pii_data_redaction_transformation" diff --git a/notebook_examples/feature_store_querying.ipynb b/notebook_examples/feature_store_querying.ipynb index cb6ff2b0..a243dad5 100644 --- a/notebook_examples/feature_store_querying.ipynb +++ b/notebook_examples/feature_store_querying.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store-querying.ipynb,\n", " title: Feature store handling querying operations\n", " summary: Using feature store to transform, store and query your data using pandas like interface to query and join\n", - " developed_on: fspyspark32_p38_cpu_v2,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -21,7 +21,7 @@ "metadata": {}, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -124,9 +124,9 @@ "\n", "Notebook Sessions are accessible through the following conda environment: \n", "\n", - "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v2)**\n", + "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v3)**\n", "\n", - "You can customize `fspyspark32_p38_cpu_v2`, publish it, and use it as a runtime environment for a Notebook session.\n" + "You can customize `fspyspark32_p38_cpu_v3`, publish it, and use it as a runtime environment for a Notebook session.\n" ] }, { @@ -414,7 +414,7 @@ " FeatureStore().\n", " with_description(\"Data consisting of flights\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"flights details\").\n", + " with_name(\"flights details\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -460,7 +460,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Flight details2\",\n", + " name=\"Flight details2\",\n", " description=\"description for flight details\"\n", ")\n", "entity" diff --git a/notebook_examples/feature_store_quickstart.ipynb b/notebook_examples/feature_store_quickstart.ipynb index 46c12fd7..7704f9ad 100644 --- a/notebook_examples/feature_store_quickstart.ipynb +++ b/notebook_examples/feature_store_quickstart.ipynb @@ -8,7 +8,7 @@ "@notebook{feature_store-quickstart.ipynb,\n", " title: Using feature store for feature ingestion and feature querying,\n", " summary: Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying,\n", - " developed_on: fspyspark32_p38_cpu_v2,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -21,7 +21,7 @@ "metadata": {}, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -117,9 +117,9 @@ "\n", "# 2. Pre-requisites to Running this Notebook\n", "\n", - "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v2) conda environment.\n", + "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v3) conda environment.\n", "\n", - "You can customize `fspyspark32_p38_cpu_v2`, publish it, and use it as a runtime environment for a Notebook session cluster. " + "You can customize `fspyspark32_p38_cpu_v3`, publish it, and use it as a runtime environment for a Notebook session cluster. " ] }, { @@ -313,7 +313,7 @@ " FeatureStore().\n", " with_description(\"Data consisting of bike riders data\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"Bike rides\").\n", + " with_name(\"Bike rides\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -346,7 +346,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Bike rides\",\n", + " name=\"Bike rides\",\n", " description=\"description for bike riders\"\n", ")" ] @@ -383,7 +383,7 @@ "transformation = feature_store.create_transformation(\n", " transformation_mode=TransformationMode.PANDAS,\n", " source_code_func=is_round_trip,\n", - " display_name=\"is_round_trip\",\n", + " name=\"is_round_trip\",\n", ")\n", "transformation" ] diff --git a/notebook_examples/feature_store_schema_evolution.ipynb b/notebook_examples/feature_store_schema_evolution.ipynb index 215b851b..3e4d629e 100644 --- a/notebook_examples/feature_store_schema_evolution.ipynb +++ b/notebook_examples/feature_store_schema_evolution.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store_schema_evolution.ipynb,\n", " title: Schema Enforcement and Schema Evolution in Feature Store,\n", " summary: Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data.,\n", - " developed_on: fspyspark32_p38_cpu_v2,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying ,schema enforcement,schema evolution\n", " license: Universal Permissive License v 1.0\n", "}" @@ -21,7 +21,7 @@ "metadata": {}, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -114,9 +114,9 @@ "source": [ "\n", "# 2. Pre-requisites to Running this Notebook\n", - "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v2) conda environment.\n", + "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v3) conda environment.\n", "\n", - "You can customize `fspyspark32_p38_cpu_v2`, publish it, and use it as a runtime environment for a Notebook session.\n" + "You can customize `fspyspark32_p38_cpu_v3`, publish it, and use it as a runtime environment for a Notebook session.\n" ] }, { @@ -346,7 +346,7 @@ " FeatureStore().\n", " with_description(\"Data consisting of flights\").\n", " with_compartment_id(compartment_id).\n", - " with_display_name(\"flights details\").\n", + " with_name(\"flights details\").\n", " with_offline_config(metastore_id=metastore_id)\n", ")" ] @@ -394,7 +394,7 @@ "outputs": [], "source": [ "entity = feature_store.create_entity(\n", - " display_name=\"Flight details schema evolution/enforcement\",\n", + " name=\"Flight details schema evolution/enforcement\",\n", " description=\"description for flight details\"\n", ")\n", "entity" diff --git a/notebook_examples/feature_store_spark_magic.ipynb b/notebook_examples/feature_store_spark_magic.ipynb index 33e3ac1c..76886621 100644 --- a/notebook_examples/feature_store_spark_magic.ipynb +++ b/notebook_examples/feature_store_spark_magic.ipynb @@ -8,7 +8,7 @@ "qweews@notebook{feature_store_spark_magic.ipynb,\n", " title: Data Flow Studio : Big Data Operations in Feature Store.,\n", " summary: Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster.,\n", - " developed_on: fspyspark32_p38_cpu_v2,\n", + " developed_on: fspyspark32_p38_cpu_v3,\n", " keywords: feature store, querying,spark magic,data flow\n", " license: Universal Permissive License v 1.0\n", "}" @@ -21,7 +21,7 @@ "metadata": {}, "outputs": [], "source": [ - "!odsc conda install -s fspyspark32_p38_cpu_v2" + "!odsc conda install -s fspyspark32_p38_cpu_v3" ] }, { @@ -129,7 +129,7 @@ "\n", "# 2. Pre-requisites to Running this Notebook\n", "\n", - "Data Flow Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v2) conda environment.\n", + "Data Flow Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v3) conda environment.\n", "\n", "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service." ] @@ -249,7 +249,7 @@ "metastore_id = \"\"\n", "logs_bucket_uri = \"\"\n", "\n", - "custom_conda_environment_uri = \"oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v2#conda\"" + "custom_conda_environment_uri = \"oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v3#conda\"" ] }, { @@ -455,7 +455,7 @@ "feature_store_resource = FeatureStore(). \\\n", " with_description(\"Feature Store Description\"). \\\n", " with_compartment_id(compartment_id). \\\n", - " with_display_name(\"FeatureStore\"). \\\n", + " with_name(\"FeatureStore\"). \\\n", " with_offline_config(metastore_id=metastore_id)\n", "\n", "feature_store = feature_store_resource.create()\n", diff --git a/notebook_examples/index.json b/notebook_examples/index.json index 2514f65e..814b0246 100644 --- a/notebook_examples/index.json +++ b/notebook_examples/index.json @@ -596,7 +596,7 @@ "title": "XGBoost with RAPIDS" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_quickstart.ipynb", "keywords": [ "pyspark", @@ -614,7 +614,7 @@ "title": "Feature Store Quickstart" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_querying.ipynb", "keywords": [ "pyspark", @@ -632,7 +632,7 @@ "title": "Feature store handling querying operations" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_schema_evolution.ipynb", "keywords": [ "pyspark", @@ -648,7 +648,7 @@ "title": "Schema Enforcement and Schema Evolution in Feature Store" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_spark_magic.ipynb", "keywords": [ "pyspark", @@ -665,7 +665,7 @@ "title": "Data Flow Studio : Big Data Operations in Feature Store" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_embeddings.ipynb", "keywords": [ "pyspark", @@ -682,7 +682,7 @@ "title": "Feature store embeddings using transformation" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_medical_synthetic_data_openai.ipynb", "keywords": [ "pyspark", @@ -701,7 +701,7 @@ "title": "Feature store to perform synthetic data generation using openai" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_pii_redaction_and_transformation.ipynb", "keywords": [ "pyspark", @@ -722,7 +722,7 @@ "title": "Feature store to perform PII data redaction, summarization, translation using openai and redact" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_embeddings_openai.ipynb", "keywords": [ "pyspark", @@ -743,7 +743,7 @@ "title": "Use feature store to keep open ai embeddings using openai.Embedding.create" }, { - "developed_on": "fspyspark32_p38_cpu_v2", + "developed_on": "fspyspark32_p38_cpu_v3", "filename": "feature_store_ehr_data.ipynb", "keywords": [ "pyspark",