Comprehensive collection of data mining, machine learning, and deep learning sample codes and tutorials
This repository contains educational materials and practical implementations for:
- SJSU CMPE255 - Data Mining
- SJSU CMPE257 - Machine Learning
- SJSU CMPE258 - Deep Learning
- SJSU CMPE249 - Intelligent Autonomous Systems
- β Comprehensive tutorials from basic Python to advanced deep learning
- β Large Language Models (LLMs) - newly added section
- β Google Colab integration (some examples require SJSU Google account)
- β Modern documentation with Sphinx and Furo theme
- β Multi-platform support (local, HPC, cloud)
π Documentation: ReadTheDocs
- π DeepDataMiningLearning
- π Features
- π Table of Contents
- βοΈ Setup & Installation
- π Sphinx Documentation
- π Python Data Analytics
- βοΈ Cloud Data Analytics
- π€ Machine Learning Algorithms
- π₯ Deep Learning with PyTorch
- π§ Deep Learning with TensorFlow
- π Unsupervised Learning
- π NLP and Text Mining
- π΅ SignalAI
Install this Python package via:
git clone https://github.com/lkk688/DeepDataMiningLearning.git
cd DeepDataMiningLearning
pip install flit
flit install --symlink
After installation, you can import the package in Python:
import DeepDataMiningLearning
Windows Permission Error: If you encounter OSError: [WinError 1314] A required privilege is not held by the client
, enable Developer Mode:
- Go to Settings β System β For developers β Turn on Developer Mode
Detailed Documentation: See docs/python.rst
for comprehensive package description.
Launch Jupyter Lab:
jupyter lab --ip 0.0.0.0 --no-browser --allow-root
Foundation tutorials covering Python basics, NumPy, Pandas, data visualization, and exploratory data analysis
- Colab Tutorials:
- Colab basic features: colabfeatures
- Tutorial of Colab working with external data: colabexternaldata
- Python tutorial code: Python_tutorial.ipynb--colablink
- Python NumPy tutorial code: Python NumPy tutorial--colablink
- Data Mining introduction code:
- Python data apps based on streamlit:
- Streamlit tutorial: streamlittest
- Streamlit connect to data sources: streamlitdata
- Streamlit connect to Google Big Query: streamlitbigquery
- Deploy Streamlit to Google Cloud App Engine: streamlitappengine
Google Cloud Platform integration for scalable data mining and analytics
- Data Mining based on Google Cloud:
- Google Cloud access via Colab: colablink
- Configure Gcloud, Google Cloud Storage, Compute Engine, Colab Terminal
- Google BigQuery with Colab/Jupyter introduction BigQuery-intro.ipynb -- colablink
- BigQuery setup, create and check BigQuery datasets
- Pandas EDA and visualization based on Natality dataset from BigQuery
- Weather data from Google BigQuery (revised based on Google's office sample), curve fitting via scipy
- COVID19 Data EDA and Visualization based on Google BigQuery (Fall 2022 updated): colablink
- COVID NYT data from BigQuery: prediction of CA cases via fbprophet, states with hightest cases, cases (moving average) over time curve, Heatmap of Confirmed Cases, Statewise Mask usage habits, joint data of Population from Census data, zipcode data, and impact of mask usage.
- COVID-19 JHU data: case map view, top10 states, moving average, ARIMA model, save dataframe back to BigQuery
- Additional Google BigQuery examples: colablink
- Chicago Crime Dataset, Austin Waste Dataset, COVID Racial Dataset (race graph)
- Student Samples
- BigQuery ML examples: colablink
- COVID, CREDIT_CARD_FRAUD, Predict penguin weight, Natality, US Census Dataset Classification, time-series forecasting from Google Analytics data
- BigQuery Bigframe and ML examples (2024 update): colablink
- BigQuery DataFrames: bigframes
- Bigquery ML for supervised and unsupervised learning with SKLearn stype API
- Bigquery LLM for text generation, code generation (pandas api code)
- Text Embedding, Kmeans for text embeddings
- Use PaLM2 LLM model to summarize text/complaints
- Google Cloud access via Colab: colablink
Classical machine learning implementations with scikit-learn and custom algorithms
- Machine Learning introduction:
- MLIntro-Regression -- colablink
- Linear Regression via Normal Equation and own SGD implementation
- Diamond dataset from Kaggle, EDA, and predict the price of diamond via normal equation, numpy.linalg
- MLIntro-RegressionSKLearn -- colablink
- Simulation data: linear regression, polynomial regression, Ridge and Lasso regression
- Diamond dataset: Categorical Feature Encoding, Linear Regression and other regressors
- Boston housing dataset: Regression Model for All the variables, Ridge and Lasso Regression
- Diabetes Dataset: Linear regression, Ridge and Lasso Regression
- AUTO MPG Dataset: 1) Linear regression; 2) Polynomial Regression via SKlearn PolynomialFeatures; 3) Multiple Linear Regression; 4) SKLearn Pipelines combine feature engineering stages and the model into a single "model"; 5) ColumnTransformer and FunctionTransformer; 6) Cross Validation
- MLIntro2-classification.ipynb --colablink
- Breast Cancer Dataset, iris Dataset, BigQuery US Census Income Dataset, multiple classifiers.
- DecisionTree -- colablink
- SKlearn DecisionTree algorithm on Iris dataset, Breast Cancel Dataset, Make moon dataset, and DecisionTreeRegressor. A berif discussion of Gini Impurity.
- GradientBoosting -- colablink
- Gradient boosting process, Gradient boosting regressor with scikit-learn, Gradient boosting classifier with scikit-learn
- XGBoost -- colablink
- XGBoost introduction, US Census Income Dataset from Big Query, UCI Dermatology dataset
- Naive Bayes Classifier -- colablink
- Bayesian Classification, Naive Bayes in SKlearn, Gaussian Naive Bayes, Multinomial Naive Bayes Classifying Text (20 Newsgroups)
- Naive Bayes for MNIST from Torchvision
- NVIDIA RAPIDS installation, NVIDIA Cuml Naive Bayes vs Sklearn Naive Bayes based on News Aggregator dataset
- Support Vector Machine -- colablink
- Support Vector Machines introduction, SVM for Breast Cancer Dataset, SVM for Face Recognition
- SVM for MNIST in Pytorch, Neural Network MNIST Pytorch
- MLIntro-Regression -- colablink
Modern deep learning tutorials and implementations using PyTorch framework
Complete tutorial series with detailed documentation: ReadTheDocs
Tutorial 1: PyTorch Basics - CMPE_pytorch1
CMPE_pytorch1 Colab link
- PyTorch installation, Tensors, Tensor functions, Arithmetic operations
Tutorial 2: Regression & Classification - CMPE_pytorch2
CMPE_pytorch2 Colab link
- Polynomial fitting with NumPy and PyTorch
- Linear Regression, Binary/Multi-class Classification
- Logistic Regression with scikit-learn datasets, PyTorch Logistic Regression
Tutorial 3: Automatic Differentiation - CMPE_pytorch3_autograd
CMPE_pytorch3 Colab link
- Autograd system, Computing Gradients, Jacobian Product
- Sine wave fitting example
- PyTorch NN package, Optimizer package, nn.Module usage
Tutorial 4: Neural Networks & MNIST - CMPE_pytorch4_MNIST
CMPE_pytorch4 Colab link
- MNIST Dataset handling
- torch.nn.Module and torch.nn.functional
- nn.Sequential and OrderedDict
- Loss functions (NLLLoss, CrossEntropyLoss)
- Regularization techniques (L1, L2, Elastic Net)
- Compare CrossEntropyLoss, NLLLoss, and weight decay
Tutorial 5: CNN Fundamentals - CMPE_pytorch5_imageclassification
CMPE_pytorch5 Colab link
- Multi-Layer Perceptron for MNIST and FashionMNIST
- CNN Filters and Feature Maps
- CNN architecture for CIFAR Dataset
- Comparison of MLP and CNN for MNIST, FashionMNIST, and CIFAR
Tutorial 6: Classic CNN Architectures - CMPE_pytorch6_mlp2resnet
- Multi-layer Perceptron baseline
- LeNet architecture implementation
- AlexNet for large-scale image classification
- VGG network variations
- ResNet and residual connections
Tutorial 7: Transfer Learning & TorchVision - CMPE_pytorch7_torchvision
- Flower dataset processing and handling
- VGGNet Transfer Learning implementation
- Pre-trained models (EfficientNet, ResNet, DenseNet, ViT)
- Feature extraction vs fine-tuning strategies
Tutorial 8: Advanced Training with TIMM - CMPE_pytorch8_timm
- Oxford-IIT Pet Dataset and Imagenette
- TIMM library integration and model customization
- Advanced data augmentation (RandAugment, CutMix, Mixup)
- Learning rate scheduling and optimization
- Exponential Moving Average (EMA) models
Tutorial 9: Inference Optimization - CMPE_pytorch9_inferenceoptimization
- Model export to TorchScript
- LibTorch installation and usage
- TIMM model conversion to TorchScript
- ONNX export for cross-platform deployment
Tutorial 10: Hugging Face Integration - CMPE_pytorch10_huggingfaceimage
- Hugging Face Transformers for computer vision
- Training on Food-101, CIFAR-10, and Flower datasets
- Pre-trained vision models from Hugging Face Hub
-
Pytorch introduction code (archived):
- Pytorch Introductions, Tensors, and Autograd: colablink
- Pytorch Regression and Logistic Regression: colablink
- Pytorch Simple Neural Networks: colab
- Pytorch NN modules: pytorch_nn.ipynb
-
Pytorch advanced image classification:
- Pytorch image classification introduction (MNIST, CNN filters, CIFAR, VGGNet, Flowers): colablink
- Pytorch Single GPU image classification with/without automatic mixed precision (AMP) training: singleGPU
- Pytorch Multi-GPU image classification: multiGPU
- Pytorch Multi-GPU DDP test: testTorchDDP
- Pytorch Torchvision image classification (Efficientnet) notebook on HPC: torchvisionHPC.ipynb
- Pytorch Torchvision vision transformer (ViT) notebook on HPC: torchvisionvitHPC.ipynb
- Pytorch ViT implement from scratch on HPC: ViTHPC.ipynb
- Pytorch ImageNet classification example: imagenet
- Pytorch inference example for top-k class: inference.py
- TIMM models: testtimm.ipynb
- Huggingface Transformers for Image: hfvisionmain.py
- Huggingface Images via Transformers: huggingfaceimage.ipynb
-
Advanced Multi-Modal Image Classification: githubrepo
- General purpose framework for all-in-one image classification for Tensorflow and Pytorch
- Support for multiple datasets: imagenet_blurred, tiny-imagenet-200, hymenoptera_data, CIFAR10, MNIST, flower_photos
- Support for multiple custom models ('mlpmodel1', 'lenet', 'alexnet', 'resnetmodel1', 'customresnet', 'vggmodel1', 'vggcustom', 'cnnmodel1'), all models from Torchvision and TorchHub
- Support HPC training and evaluation
Other Deep Learning sample code based on Pytorch (under the folder of "DeepDataMiningLearning")
- Siamese network: siamese_network
- TensorRT example: tensorrt.ipynb
- Object detection (other repo)
- MultiModalDetector
- myyolov7: Add YOLOv5 models with YOLOv7, performed training on COCO and WaymoCOCO dataset.
- myyolov5: My fork of the YOLOv5, convert COCO to YOLO format, changed the code to be the base code for YOLOv4, YOLOv5, and ScaledYOLOv4; performed training on COCO and WaymoCOCO dataset.
- WaymoObjectDetection
- Waymo Dataset Conversion to COCO format: WaymoCOCO
- torchvision_waymococo_train.py: performs Pytorch FasterRCNN training based on converted Waymo COCO format data. This version can be applied for any dataset with COCO format annotation
- WaymoCOCODetectron2train.py: WaymoCOCO training based on Detectron2
- mymmdetection2dtrain.py: Object Detection training and evaluation based on MMdetection2D
- CustomDetectron2
Legacy TensorFlow implementations for reference and learning Deep learning notebooks (colab link is better)
-
Tensorflow introduction code: CMPE-Tensorflow1.ipynb -- colablink
-
Tensorflow image classification:
- Road sign data from Kaggle example: Tensorflow-Roadsignclassification.ipynb, colablink
- Flower dataset example with TF Dataset, TFRecord, Google Cloud Storage, TPU/GPU acceleration: colablink
Dimensionality reduction, clustering, and manifold learning techniques
- Unsupervised Learning Jupyter notebooks
- PCA: colablink
- Affine Transformation via Matrix Application, eigenvalues and eigenvectors, eigendecomposition
- Numpy/SKlearn SVD, SVD for linear regression
- Principal Component Analysis in SKLearn, PCA as dimensionality reduction, PCA for SkLearn Iris dataset
- PCA for Image Compression, PCA for digits and noise filtering, Grey Image Example, Color Image Example, eigenfaces, PCA vs LDA vs NCA
- Manifold Learning: colablink
- Multidimensional Scaling (MDS), Locally Linear Embedding (LLE), Isomap Embedding, T-distributed Stochastic Neighbor Embedding for HELLO, S-Curve, and Swiss roll dataset; Isomap on Faces; Regression with Mainfold Learning
- Clustering: colablink
- K-Means, Gaussian Mixture Models, Spectral Clustering, DBSCAN
- PCA: colablink
Natural Language Processing and Large Language Models (LLMs) implementations
Text Representations - Colab Notebook
- One-Hot encoding, Bag-of-Words, TF-IDF
- Word2Vec with Gensim (Wikipedia and Shakespeare examples)
- Data gathering from Google and WordCloud visualization
Text Extraction & NLTK - Colab Notebook
- Text extraction using Textract
- NLTK preprocessing pipelines
TensorFlow Text Processing - Colab Notebook
- Keras embedding layers
- Sentiment classification
- Skip-gram Word2Vec implementation
Text Classification Models - Colab Notebook
- RNN, LSTM, Transformer, BERT architectures
Twitter NLP Pipeline - Colab Notebook
- NLTK preprocessing
- LSTM, Bi-LSTM, GRU models
- BERT fine-tuning
- Recommendation
State-of-the-art NLP models using Hugging Face Transformers library
Foundation Tutorials:
HuggingfaceTransformers.ipynb
- Introduction notebookhuggingfacetest.py
- Basic usage exampleshfdataset.py
- Dataset handlinghuggingfaceHPC.ipynb
- HPC training setuphuggingfaceHPCdata.py
- HPC data processing
BERT Applications:
BERTMTLfakehate.py
- Multi-task learning for fake news and hate speechMLTclassifier.py
- Multi-label text classificationhuggingfaceClassifierNER.ipynb
- Classification and NER
Multi-modal Classification:
huggingfaceclassifier2.py
- Advanced multi-modal classifierhuggingfaceclassifier.py
- Basic multi-modal classifier
Translation & Summarization:
huggingfaceSequence.ipynb
- Translation and summarization models
Question Answering:
huggingfaceQA.py
- Q&A system implementation
Conversational AI:
huggingfacechatbot.ipynb
- Chatbot development
Custom Implementation:
torchtransformer.py
- Transformer from scratch in PyTorch
Open Source LLMs
- BERTLM.ipynb
- Masked Language Modeling: huggingfaceLM.ipynb
- llama2
LLMs Apps based on OpenAI API
LLMs Apps based on LangChain
Train a basic language modeling task via basic Pytorch and Torchtext WikiText2 dataset in HPC.
python nlp/torchtransformer.py
| epoch 1 | 2800/ 2928 batches | lr 5.00 | ms/batch 5.58 | loss 3.31 | ppl 27.49
-----------------------------------------------------------------------------------------
| end of epoch 1 | time: 24.00s | valid loss 1.96 | valid ppl 7.08
-----------------------------------------------------------------------------------------
| epoch 2 | 200/ 2928 batches | lr 4.75 | ms/batch 5.84 | loss 3.07 | ppl 21.57
| epoch 2 | 2800/ 2928 batches | lr 4.75 | ms/batch 5.49 | loss 2.58 | ppl 13.26
-----------------------------------------------------------------------------------------
| end of epoch 2 | time: 1655.94s | valid loss 1.52 | valid ppl 4.57
-----------------------------------------------------------------------------------------
| epoch 3 | 200/ 2928 batches | lr 4.51 | ms/batch 5.04 | loss 2.41 | ppl 11.15
-----------------------------------------------------------------------------------------
| end of epoch 3 | time: 15.41s | valid loss 1.44 | valid ppl 4.22
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss 1.40 | test ppl 4.06
=========================================================================================
Train Masked Language model:
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --data_name="eli5" --model_checkpoint="distilroberta-base" --task="CLM" --subset=5000 --traintag="1115CLM" --usehpc=True --gpuid=1 --batch_size=32 --learningrate=2e-5
python nlp/huggingfaceLM2.py
data_type=huggingface data_name=eli5 dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=distilroberta-base task=MLM unfreezename= outputdir=./output traintag=1116MLM training=True usehpc=False gpuid=0 total_epochs=8 save_every=2 batch_size=32 learningrate=2e-05
Trainoutput folder: ./output\distilroberta-base\eli5_1116MLM
....
Epoch 1: Perplexity: 12.102828644322578 | 6590/26360 [16:06:09<34:50:06, 6.34s/it]
38%|βββββββββββββββββββββββ>>> Epoch 2: Perplexity: 15.187707787848385 | 9885/26360 [24:00:07<28:57:25, 6.33s/it]
50%|βββββββββββββββββββββββ>>> Epoch 3: Perplexity: 15.063201196763071 | 13180/26360 [31:52:08<23:12:51, 6.34s/it]
62%|βββββββββββββββββββββββ>>> Epoch 4: Perplexity: 16.583895970053355 | 16475/26360 [39:44:32<17:23:28, 6.33s/it]
75%|βββββββββββββββββββββββ>>> Epoch 5: Perplexity: 16.27479412837067βββββββ | 19770/26360 [47:36:46<11:34:43, 6.33s/it]
88%|βββββββββββββββββββββββ>>> Epoch 6: Perplexity: 16.424729093343636ββββββββββββββββββ | 23065/26360 [55:28:38<5:47:18, 6.32s/it]
100%|βββββββββββββββββββββββ>>> Epoch 7: Perplexity: 17.22636450834783
Train GPT2 language models
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --model_checkpoint="gpt2" --task="CLM" --traintag="1115gpt2" --usehpc=True --gpuid=2 --batch_size=16
Train llama2 7b model and only unfreeze the last layers "model.layers.31" (need 500GB) or "lm_head" (need 40GB)
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --model_checkpoint="Llama-2-7b-chat-hf" --task="CLM" --unfreezename="lm_head" --traintag="1115llama2" --usehpc=True --gpuid=2 --batch_size=8
python nlp/huggingfaceLM2.py --model_checkpoint="Llama-2-7b-chat-hf" --pretrained=="/data/cmpe249-fa23/trainoutput/huggingface/Llama-2-7b-chat-hf/eli5_1115llama2/savedmodel.pth" --task="CLM" --unfreezename="lm_head" --traintag="1119llama2" --usehpc=True --gpuid=0 --batch_size=8
.....
Epoch 0: Perplexity: 9.858825392857694ββββββββββββββββββββββββββββββββββββββββ| 2627/2627 [12:39:17<00:00, 3.30s/it]
25%|ββββββββββββββββββ >>> Epoch 1: Perplexity: 10.051054027867561 | 21014/84056 [22:50:31<56:09:49, 3.21s/it]
38%|βββββββββββββββββββββββ>>> Epoch 2: Perplexity: 10.181400762228291 | 31521/84056
Epoch 0: Perplexity: 9.289763256151375
Epoch 1: Perplexity: 9.530650993830372
Epoch 2: Perplexity: 9.692566051540275
Train translation models based on huggingfaceSequence
python nlp/huggingfaceSequence.py --data_name="kde4" --model_checkpoint="Helsinki-NLP/opus-mt-en-fr" --task="Seq2SeqLM" --traintag="1116" --usehpc=True --gpuid=0 --batch_size=8
epoch 0, BLEU score: 51.78βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2628/2628 [21:18<00:00, 3.00it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2628/2628 [22:54<00:00, 1.91it/s]
epoch 1, BLEU score: 52.73βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2628/2628 [22:54<00:00, 2.81it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2628/2628 [23:06<00:00, 1.90it/s]
epoch 2, BLEU score: 54.
epoch 3, BLEU score: 54.
epoch 4, BLEU score: 55.
epoch 5, BLEU score: 55.
epoch 6, BLEU score: 54.
epoch 7, BLEU score: 55.
python nlp/huggingfaceSequence.py --data_name="opus100" --model_checkpoint="facebook/wmt21-dense-24-wide-en-x" --task="Seq2SeqLM" --traintag="1121" --usehpc=True --gpuid=1 --batch_size=8
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="opus100" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --task="Seq2SeqLM" --traintag="1122" --evaluate="" --usehpc=True --
gpuid=1 --batch_size=32
python nlp/huggingfaceSequence2.py --data_name="opus100" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/Helsinki-NLP/opus-mt-en-zh/opus100_1122/savedmodel.pth" --task="Seq2SeqLM" --target_lang="zh" --traintag="1122" --evaluate=True --usehpc=True --gpuid=1 --total_epochs=16 --batch_size=32
.....
epoch 14, BLEU score: 48.55
epoch 15, BLEU score: 48.53
python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --task="Seq2SeqLM" --target_lang="zh" --traintag="1123" --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=16 --batch_size=32
(mycondapy310) [010796032@cs002 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=10000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --traintag="1123" --useHFaccelerator=True --ev
aluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=16 --batch_size=64
.....
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1123
epoch 0, BLEU score: 2.66
epoch 1, BLEU score: 3.99
epoch 2, BLEU score: 5.31
epoch 3, BLEU score: 6.64
epoch 4, BLEU score: 8.07
epoch 5, BLEU score: 9.51
epoch 9, BLEU score: 14.39
epoch 14, BLEU score: 18.99
epoch 15, BLEU score: 19.77
(mycondapy310) [010796032@cs002 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1123/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=32 --batch_size=64
epoch 16, BLEU score: 50.83
epoch 31, BLEU score: 56.57
python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --total_epochs=48 --batch_size=64
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124
epoch 32, BLEU score: 9.67
epoch 33, BLEU score: 13.95
epoch 35, BLEU score: 20.78
epoch 45, BLEU score: 37.60
epoch 47, BLEU score: 39.47
python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --total_epochs=64 --batch_size=64
epoch 48, BLEU score: 60.04
epoch 63, BLEU score: 59.26
$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=0 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=False --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=80 --batch_size=64
Train T5-base in local computer
/nlp/huggingfaceSequence2.py
data_type=huggingface data_name=opus100 dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=Seq2SeqLM evaluate=True source_lang=en target_lang=zh source_prefix=None pretrained= unfreezename= outputdir=./output traintag=1122 training=True usehpc=False useHFaccelerator=False gpuid=0 total_epochs=8 save_every=2 batch_size=16 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: ./output\t5-base\opus100_1122
epoch 0, BLEU score: 48.70
epoch 1, BLEU score: 50.81
epoch 2, BLEU score: 50.24
epoch 3, BLEU score: 51.93
epoch 4, BLEU score: 52.34
epoch 5, BLEU score: 52.54
epoch 6, BLEU score: 52.71
epoch 7, BLEU score: 52.91
(mycondapy39) PS C:\Users\lkk68\Documents\GitHub\DeepDataMiningLearning> cat .\output\t5-base\opus100_1122\eval_results.json
{"eval_bleu": 52.909234849408264}
nlp/huggingfaceSequence2.py
data_type=huggingface data_name=opus_books dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=Seq2SeqLM evaluate=localevaluate source_lang=en target_lang=fr source_prefix=None pretrained= unfreezename= outputdir=./output traintag=1124 training=True usehpc=False useHFaccelerator=False gpuid=0 total_epochs=16 save_every=2 batch_size=16 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: ./output\t5-base\opus_books_1124
HF evaluator: 24.46
epoch 0, BLEU score: 24.47
HF evaluator: 26.00
epoch 14, BLEU score: 26.00
HF evaluator: 25.89
epoch 15, BLEU score: 25.90
Train summarization model based on "cnn_dailymail" dataset
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence3.py --data_name="cnn_dailymail" --subset=0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125" --gpuid=1 --total_epochs=8 --batch_size=32
useHFevaluator: False
dualevaluator: False
data_type=huggingface data_name=cnn_dailymail dataconfig= subset=0.0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=summarization hfevaluate=False dualevaluate=False source_lang=en target_lang=fr source_prefix=summarize: pretrained= unfreezename= outputdir=./output traintag=1125 training=True usehpc=True useHFaccelerator=False gpuid=1 total_epochs=8 save_every=2 batch_size=32 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/cnn_dailymail_1125
epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.41197173103605056, recall=0.3119407931701639, fmeasure=0.3419831177812576), mid=Score(precision=0.41485220723500893, recall=0.3142588821867073, fmeasure=0.3441474199740058), high=Score(precision=0.417600698303463, recall=0.3165054333064982, fmeasure=0.3464475097686251)), 'rouge2': AggregateScore(low=Score(precision=0.17079493326536863, recall=0.12954017819776567, fmeasure=0.14168846158539045), mid=Score(precision=0.1732732332588367, recall=0.13143856889123576, fmeasure=0.14365393002489873), high=Score(precision=0.17556212138667857, recall=0.1333017602307039, fmeasure=0.14553710285584648)), 'rougeL': AggregateScore(low=Score(precision=0.2965678275511713, recall=0.22464618504694173, fmeasure=0.24595299928001602), mid=Score(precision=0.2989426482943214, recall=0.22661167405971072, fmeasure=0.24784869235990342), high=Score(precision=0.30157116940828527, recall=0.22857528928797705, fmeasure=0.24990879681324848)), 'rougeLsum': AggregateScore(low=Score(precision=0.29642059763679396, recall=0.22452637387862667, fmeasure=0.2458545440066565), mid=Score(precision=0.298870626902608, recall=0.22656017215162805, fmeasure=0.24780842037777753), high=Score(precision=0.3013588263275077, recall=0.22874002983645225, fmeasure=0.24986052268174044))}
.....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.4106620116159846, recall=0.3359116786022911, fmeasure=0.3571395048710354), mid=Score(precision=0.4113719796401594, recall=0.3365233125614775, fmeasure=0.3577153054159934), high=Score(precision=0.41210309679230933, recall=0.3371437303658093, fmeasure=0.3583162407770864)), 'rouge2': AggregateScore(low=Score(precision=0.17094983781529516, recall=0.140263702081492, fmeasure=0.1487497359572444), mid=Score(precision=0.17152532221996442, recall=0.14077116290613728, fmeasure=0.1492659697570951), high=Score(precision=0.17217316198104332, recall=0.14133598283522728, fmeasure=0.14982901584825192)), 'rougeL': AggregateScore(low=Score(precision=0.29437670375737585, recall=0.241532114907138, fmeasure=0.2562224943410107), mid=Score(precision=0.295004497912973, recall=0.24202153711108693, fmeasure=0.256706856828388), high=Score(precision=0.29558396252033226, recall=0.2425496385594932, fmeasure=0.25720842817590284)), 'rougeLsum': AggregateScore(low=Score(precision=0.2943661396563572, recall=0.24149928244505112, fmeasure=0.2561939698587155), mid=Score(precision=0.29501206428652565, recall=0.24205477312718807, fmeasure=0.25673184483277667), high=Score(precision=0.2956055704727816, recall=0.2425905810654754, fmeasure=0.25723630538221925))}
Train summarization based on "billsum" dataset
python nlp/huggingfaceSequence3.py --data_name="billsum" --subset=0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125" --gpuid=2 --total_epochs=8 --batch_size=64
epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.47626942983509496, recall=0.25432053054933895, fmeasure=0.3104459523870372), mid=Score(precision=0.48118215557348354, recall=0.25758086912668254, fmeasure=0.31351244843801257), high=Score(precision=0.48633004345160047, recall=0.2610377756856256, fmeasure=0.31666293379599264)), 'rouge2': AggregateScore(low=Score(precision=0.22553581027732506, recall=0.1163356040224796, fmeasure=0.1426990222648773), mid=Score(precision=0.22980182467297072, recall=0.11916438705516172, fmeasure=0.14559973720411068), high=Score(precision=0.23448683629864495, recall=0.12186100986547159, fmeasure=0.14852517085299424)), 'rougeL': AggregateScore(low=Score(precision=0.37475945811360656, recall=0.20056913183887007, fmeasure=0.24410014882029277), mid=Score(precision=0.3791703619589465, recall=0.2035902141913645, fmeasure=0.24684120728025757), high=Score(precision=0.38361974259418913, recall=0.20659199136456374, fmeasure=0.24957227859403597)), 'rougeLsum': AggregateScore(low=Score(precision=0.3748793629512802, recall=0.200607724566369, fmeasure=0.24403997308680897), mid=Score(precision=0.379214986503034, recall=0.20357105315951063, fmeasure=0.24678602914830827), high=Score(precision=0.38368049681408073, recall=0.20677635875973946, fmeasure=0.2497749989951042))}
.....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.4762322399523528, recall=0.3612750168693367, fmeasure=0.3859539337308937), mid=Score(precision=0.47741449894622395, recall=0.36223979353915375, fmeasure=0.3867387680371833), high=Score(precision=0.47857623413135664, recall=0.36315881475914685, fmeasure=0.387529008724504)), 'rouge2': AggregateScore(low=Score(precision=0.23795846378692775, recall=0.17739298212614543, fmeasure=0.1895061898823578), mid=Score(precision=0.23901236113717203, recall=0.17821036280792862, fmeasure=0.1902598464683951), high=Score(precision=0.24012950529243016, recall=0.1789825350527308, fmeasure=0.19103010704737328)), 'rougeL': AggregateScore(low=Score(precision=0.38008605052931843, recall=0.28951287826449035, fmeasure=0.307987126768515), mid=Score(precision=0.3811360590765148, recall=0.29038102311603803, fmeasure=0.30870601375792006), high=Score(precision=0.38221337267972383, recall=0.2911927830141367, fmeasure=0.3094361849338897)), 'rougeLsum': AggregateScore(low=Score(precision=0.3800857948090643, recall=0.2895777984554257, fmeasure=0.3080188262526704), mid=Score(precision=0.38110308526126024, recall=0.2903936893613658, fmeasure=0.3087163027737505), high=Score(precision=0.3821455084830249, recall=0.29123147942293837, fmeasure=0.3094164695199439))}
Train summarization based on "xsum" dataset
(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence3.py --data_name="xsum" --subset=
0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125
" --gpuid=1 --total_epochs=8 --batch_size=64
epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.2206634425318806, recall=0.27398080027290334, fmeasure=0.23345078599469463), mid=Score(precision=0.2228479631573893, recall=0.2758314532199041, fmeasure=0.2352387063491203), high=Score(precision=0.22497474810005727, recall=0.27769139479498234, fmeasure=0.23719704189066906)), 'rouge2': AggregateScore(low=Score(precision=0.04855549563663333, recall=0.053939101970654914, fmeasure=0.04875180161702199), mid=Score(precision=0.049879370509959026, recall=0.05516244414389607, fmeasure=0.04995842715629054), high=Score(precision=0.05118590600673647, recall=0.05640622198150646, fmeasure=0.05111651526817903)), 'rougeL': AggregateScore(low=Score(precision=0.1618905690819521, recall=0.19780661742397618, fmeasure=0.16987613018217562), mid=Score(precision=0.16369315145103713, recall=0.19940773965682818, fmeasure=0.17143884336484203), high=Score(precision=0.16554847486855678, recall=0.20099105646119897, fmeasure=0.17303926810412723)), 'rougeLsum': AggregateScore(low=Score(precision=0.16171036371143727, recall=0.19775741367206912, fmeasure=0.16972996097941762), mid=Score(precision=0.16364819272318623, recall=0.19936460332720274, fmeasure=0.1713745081258375), high=Score(precision=0.16540503402328743, recall=0.20093144657877907, fmeasure=0.17297056660820934))}
....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.3157722202905392, recall=0.30514034915305777, fmeasure=0.30098661341663313), mid=Score(precision=0.3164460427364083, recall=0.30573870098459877, fmeasure=0.3015941401174891), high=Score(precision=0.31710221504502506, recall=0.30627353921164463, fmeasure=0.3021231398386561)), 'rouge2': AggregateScore(low=Score(precision=0.09545526361853192, recall=0.08805500161867533, fmeasure=0.08929094188051703), mid=Score(precision=0.0959138137320461, recall=0.08846040412480724, fmeasure=0.08970127642926173), high=Score(precision=0.09638196118158575, recall=0.08887820756655522, fmeasure=0.09011808402709783)), 'rougeL': AggregateScore(low=Score(precision=0.2439907968202854, recall=0.23420304190160135, fmeasure=0.23192429294508443), mid=Score(precision=0.2445882204337065, recall=0.23468338778240327, fmeasure=0.23243230200308637), high=Score(precision=0.2451722573264371, recall=0.23517123971453874, fmeasure=0.2329239687124055)), 'rougeLsum': AggregateScore(low=Score(precision=0.24400824561132867, recall=0.23421060757686038, fmeasure=0.2319498933428253), mid=Score(precision=0.24459781572922448, recall=0.23469061202638752, fmeasure=0.2324341602958753), high=Score(precision=0.24514334749025868, recall=0.23516229230273153, fmeasure=0.23293317648718587))}
Run question and answering for squad dataset based on huggingfaceSequence4.py.
nlp/huggingfaceSequence4.py
HF evaluator: {'exact_match': 0.22, 'f1': 6.222554522104021}
Start training, total steps: 79696
epoch 0, evaluation metric: squad
Evaluation result: {'exact_match': 63.36, 'f1': 77.10714274394753}
epoch 15, evaluation metric: squad
Evaluation result: {'exact_match': 62.08, 'f1': 75.95170387159816}
Run question and answering for squad dataset based on custom bert model in huggingfaceSequence4.py.
nlp/huggingfaceSequence4.py
epoch 0: {'exact_match': 0.7663197729422895, 'f1': 0.8230842005676446}
epoch 1: {'exact_match': 0.7947019867549668, 'f1': 0.8360138757489753}
epoch 5: {'exact_match': 0.8609271523178808, 'f1': 0.8607573442010529}
epoch 7: {'exact_match': 0.8703878902554399, 'f1': 0.8735414695679595}
Run open-ended question and answering for squadv2 dataset based on T5 model in huggingfaceSequence5.py
epoch 0, evaluation metric: squad_v2
Evaluation result: {'exact': 75.42, 'f1': 81.2122560694999, 'total': 5000, 'HasAns_exact': 73.17148125384142, 'HasAns_f1': 82.07169033420416, 'HasAns_total': 3254, 'NoAns_exact': 79.61053837342497, 'NoAns_f1': 79.61053837342497, 'NoAns_total': 1746, 'best_exact': 75.36, 'best_exact_thresh': 0.0, 'best_f1': 81.15225606949971, 'best_f1_thresh': 0.0}
......
epoch 15, evaluation metric: squad_v2
Evaluation result: {'exact': 76.38, 'f1': 82.8391234679352, 'total': 5000, 'HasAns_exact': 73.26367547633681, 'HasAns_f1': 83.18857324513708, 'HasAns_total': 3254, 'NoAns_exact': 82.18785796105384, 'NoAns_f1': 82.18785796105384, 'NoAns_total': 1746, 'best_exact': 76.32, 'best_exact_thresh': 0.0, 'best_f1': 82.77912346793502, 'best_f1_thresh': 0.0}
Use Sequence5.py for translation training
python nlp/huggingfaceSequence5.py --data_name="wmt19" --subset=100000 --model_checkpoint="t5-base" --task="translation" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1206" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --usehpc --gpuid=1 --total_epochs=80 --batch_size=64
Fine tune "liam168/trans-opus-mt-en-zh" on wmt19 5000 subset
epoch 15, evaluation metric: sacrebleu
Evaluation result: {'score': 45.11294317735652, 'counts': [200677, 140858, 104286, 81486], 'totals': [283438, 274938, 266464, 258142], 'precisions': [70.8010217402042, 51.232641541001975, 39.1369941155278, 31.566347204251922], 'bp': 0.9805091347328498, 'sys_len': 283438, 'ref_len': 289017}
Signal processing, time series analysis, and AI-driven signal intelligence
Perform audio classification via hfclassify1.py
:
{'eval_loss': 0.8612403869628906, 'eval_accuracy': 0.8342342342342343, 'eval_runtime': 244.7536, 'eval_samples_per_second': 9.07, 'eval_steps_per_second': 0.568, 'epoch': 8.0}
{'train_runtime': 26052.4526, 'train_samples_per_second': 6.133, 'train_steps_per_second': 0.384, 'train_loss': 1.0575269655821498, 'epoch': 8.0}
***** eval metrics *****
epoch = 8.0
eval_accuracy = 0.8342
eval_loss = 0.8612
eval_runtime = 0:04:41.22
eval_samples_per_second = 7.894
eval_steps_per_second = 0.494
Dataset: Common Language Dataset
Fine-tuning Notebook: signalAI/hfwave2vec2_finetune.ipynb
Pre-training Script: signalAI/hfwave2vec2.py
Training Command:
python signalAI/hfwave2vec2.py