🚀 DeepDataMiningLearning

Comprehensive collection of data mining, machine learning, and deep learning sample codes and tutorials

This repository contains educational materials and practical implementations for:

SJSU CMPE255 - Data Mining
SJSU CMPE257 - Machine Learning
SJSU CMPE258 - Deep Learning
SJSU CMPE249 - Intelligent Autonomous Systems

🌟 Features

✅ Comprehensive tutorials from basic Python to advanced deep learning
✅ Large Language Models (LLMs) - newly added section
✅ Google Colab integration (some examples require SJSU Google account)
✅ Modern documentation with Sphinx and Furo theme
✅ Multi-platform support (local, HPC, cloud)

📖 Documentation: ReadTheDocs

📋 Table of Contents

🚀 DeepDataMiningLearning

⚙️ Setup & Installation

Package Installation (Optional)

Install this Python package via:

git clone https://github.com/lkk688/DeepDataMiningLearning.git
cd DeepDataMiningLearning
pip install flit
flit install --symlink

After installation, you can import the package in Python:

import DeepDataMiningLearning

🐛 Troubleshooting

Windows Permission Error: If you encounter OSError: [WinError 1314] A required privilege is not held by the client, enable Developer Mode:

Go to Settings → System → For developers → Turn on Developer Mode

Detailed Documentation: See docs/python.rst for comprehensive package description.

🚀 Quick Start

Launch Jupyter Lab:

jupyter lab --ip 0.0.0.0 --no-browser --allow-root

📊 Python Data Analytics

Foundation tutorials covering Python basics, NumPy, Pandas, data visualization, and exploratory data analysis

Colab Tutorials:
- Colab basic features: colabfeatures
- Tutorial of Colab working with external data: colabexternaldata
Python tutorial code: Python_tutorial.ipynb--colablink
Python NumPy tutorial code: Python NumPy tutorial--colablink
Data Mining introduction code:
Python data apps based on streamlit:
- Streamlit tutorial: streamlittest
- Streamlit connect to data sources: streamlitdata
- Streamlit connect to Google Big Query: streamlitbigquery
- Deploy Streamlit to Google Cloud App Engine: streamlitappengine

☁️ Cloud Data Analytics

Google Cloud Platform integration for scalable data mining and analytics

Data Mining based on Google Cloud:
- Google Cloud access via Colab: colablink
  - Configure Gcloud, Google Cloud Storage, Compute Engine, Colab Terminal
- Google BigQuery with Colab/Jupyter introduction BigQuery-intro.ipynb -- colablink
  - BigQuery setup, create and check BigQuery datasets
  - Pandas EDA and visualization based on Natality dataset from BigQuery
  - Weather data from Google BigQuery (revised based on Google's office sample), curve fitting via scipy
- COVID19 Data EDA and Visualization based on Google BigQuery (Fall 2022 updated): colablink
  - COVID NYT data from BigQuery: prediction of CA cases via fbprophet, states with hightest cases, cases (moving average) over time curve, Heatmap of Confirmed Cases, Statewise Mask usage habits, joint data of Population from Census data, zipcode data, and impact of mask usage.
  - COVID-19 JHU data: case map view, top10 states, moving average, ARIMA model, save dataframe back to BigQuery
- Additional Google BigQuery examples: colablink
  - Chicago Crime Dataset, Austin Waste Dataset, COVID Racial Dataset (race graph)
- Student Samples
  - SFTheft dataset
  - AirBnB dataset
- BigQuery ML examples: colablink
  - COVID, CREDIT_CARD_FRAUD, Predict penguin weight, Natality, US Census Dataset Classification, time-series forecasting from Google Analytics data
- BigQuery Bigframe and ML examples (2024 update): colablink
  - BigQuery DataFrames: bigframes
  - Bigquery ML for supervised and unsupervised learning with SKLearn stype API
  - Bigquery LLM for text generation, code generation (pandas api code)
  - Text Embedding, Kmeans for text embeddings
  - Use PaLM2 LLM model to summarize text/complaints

🤖 Machine Learning Algorithms

Classical machine learning implementations with scikit-learn and custom algorithms

Machine Learning introduction:
- MLIntro-Regression -- colablink
  - Linear Regression via Normal Equation and own SGD implementation
  - Diamond dataset from Kaggle, EDA, and predict the price of diamond via normal equation, numpy.linalg
- MLIntro-RegressionSKLearn -- colablink
  - Simulation data: linear regression, polynomial regression, Ridge and Lasso regression
  - Diamond dataset: Categorical Feature Encoding, Linear Regression and other regressors
  - Boston housing dataset: Regression Model for All the variables, Ridge and Lasso Regression
  - Diabetes Dataset: Linear regression, Ridge and Lasso Regression
  - AUTO MPG Dataset: 1) Linear regression; 2) Polynomial Regression via SKlearn PolynomialFeatures; 3) Multiple Linear Regression; 4) SKLearn Pipelines combine feature engineering stages and the model into a single "model"; 5) ColumnTransformer and FunctionTransformer; 6) Cross Validation
- MLIntro2-classification.ipynb --colablink
  - Breast Cancer Dataset, iris Dataset, BigQuery US Census Income Dataset, multiple classifiers.
- DecisionTree -- colablink
  - SKlearn DecisionTree algorithm on Iris dataset, Breast Cancel Dataset, Make moon dataset, and DecisionTreeRegressor. A berif discussion of Gini Impurity.
- GradientBoosting -- colablink
  - Gradient boosting process, Gradient boosting regressor with scikit-learn, Gradient boosting classifier with scikit-learn
- XGBoost -- colablink
  - XGBoost introduction, US Census Income Dataset from Big Query, UCI Dermatology dataset
- Naive Bayes Classifier -- colablink
  - Bayesian Classification, Naive Bayes in SKlearn, Gaussian Naive Bayes, Multinomial Naive Bayes Classifying Text (20 Newsgroups)
  - Naive Bayes for MNIST from Torchvision
  - NVIDIA RAPIDS installation, NVIDIA Cuml Naive Bayes vs Sklearn Naive Bayes based on News Aggregator dataset
- Support Vector Machine -- colablink
  - Support Vector Machines introduction, SVM for Breast Cancer Dataset, SVM for Face Recognition
  - SVM for MNIST in Pytorch, Neural Network MNIST Pytorch

🔥 Deep Learning with PyTorch

Modern deep learning tutorials and implementations using PyTorch framework

📚 PyTorch Tutorial Series

Complete tutorial series with detailed documentation: ReadTheDocs

🎯 Fundamentals & Core Concepts

Tutorial 1: PyTorch Basics - CMPE_pytorch1 CMPE_pytorch1 Colab link

PyTorch installation, Tensors, Tensor functions, Arithmetic operations

Tutorial 2: Regression & Classification - CMPE_pytorch2 CMPE_pytorch2 Colab link

Polynomial fitting with NumPy and PyTorch
Linear Regression, Binary/Multi-class Classification
Logistic Regression with scikit-learn datasets, PyTorch Logistic Regression

Tutorial 3: Automatic Differentiation - CMPE_pytorch3_autograd CMPE_pytorch3 Colab link

Autograd system, Computing Gradients, Jacobian Product
Sine wave fitting example
PyTorch NN package, Optimizer package, nn.Module usage

Tutorial 4: Neural Networks & MNIST - CMPE_pytorch4_MNIST CMPE_pytorch4 Colab link

MNIST Dataset handling
torch.nn.Module and torch.nn.functional
nn.Sequential and OrderedDict
Loss functions (NLLLoss, CrossEntropyLoss)
Regularization techniques (L1, L2, Elastic Net)
Compare CrossEntropyLoss, NLLLoss, and weight decay

🖼️ Computer Vision & Image Classification

Tutorial 5: CNN Fundamentals - CMPE_pytorch5_imageclassification CMPE_pytorch5 Colab link

Multi-Layer Perceptron for MNIST and FashionMNIST
CNN Filters and Feature Maps
CNN architecture for CIFAR Dataset
Comparison of MLP and CNN for MNIST, FashionMNIST, and CIFAR

Tutorial 6: Classic CNN Architectures - CMPE_pytorch6_mlp2resnet

Multi-layer Perceptron baseline
LeNet architecture implementation
AlexNet for large-scale image classification
VGG network variations
ResNet and residual connections

Tutorial 7: Transfer Learning & TorchVision - CMPE_pytorch7_torchvision

Flower dataset processing and handling
VGGNet Transfer Learning implementation
Pre-trained models (EfficientNet, ResNet, DenseNet, ViT)
Feature extraction vs fine-tuning strategies

Tutorial 8: Advanced Training with TIMM - CMPE_pytorch8_timm

Oxford-IIT Pet Dataset and Imagenette
TIMM library integration and model customization
Advanced data augmentation (RandAugment, CutMix, Mixup)
Learning rate scheduling and optimization
Exponential Moving Average (EMA) models

⚡ Model Optimization & Deployment

Tutorial 9: Inference Optimization - CMPE_pytorch9_inferenceoptimization

Model export to TorchScript
LibTorch installation and usage
TIMM model conversion to TorchScript
ONNX export for cross-platform deployment

Tutorial 10: Hugging Face Integration - CMPE_pytorch10_huggingfaceimage

Hugging Face Transformers for computer vision
Training on Food-101, CIFAR-10, and Flower datasets
Pre-trained vision models from Hugging Face Hub

Pytorch introduction code (archived):
- Pytorch Introductions, Tensors, and Autograd: colablink
- Pytorch Regression and Logistic Regression: colablink
- Pytorch Simple Neural Networks: colab
- Pytorch NN modules: pytorch_nn.ipynb
Pytorch advanced image classification:
- Pytorch image classification introduction (MNIST, CNN filters, CIFAR, VGGNet, Flowers): colablink
- Pytorch Single GPU image classification with/without automatic mixed precision (AMP) training: singleGPU
- Pytorch Multi-GPU image classification: multiGPU
- Pytorch Multi-GPU DDP test: testTorchDDP
- Pytorch Torchvision image classification (Efficientnet) notebook on HPC: torchvisionHPC.ipynb
- Pytorch Torchvision vision transformer (ViT) notebook on HPC: torchvisionvitHPC.ipynb
- Pytorch ViT implement from scratch on HPC: ViTHPC.ipynb
- Pytorch ImageNet classification example: imagenet
- Pytorch inference example for top-k class: inference.py
- TIMM models: testtimm.ipynb
- Huggingface Transformers for Image: hfvisionmain.py
- Huggingface Images via Transformers: huggingfaceimage.ipynb
Advanced Multi-Modal Image Classification: githubrepo
- General purpose framework for all-in-one image classification for Tensorflow and Pytorch
- Support for multiple datasets: imagenet_blurred, tiny-imagenet-200, hymenoptera_data, CIFAR10, MNIST, flower_photos
- Support for multiple custom models ('mlpmodel1', 'lenet', 'alexnet', 'resnetmodel1', 'customresnet', 'vggmodel1', 'vggcustom', 'cnnmodel1'), all models from Torchvision and TorchHub
- Support HPC training and evaluation

Other Deep Learning sample code based on Pytorch (under the folder of "DeepDataMiningLearning")

Siamese network: siamese_network
TensorRT example: tensorrt.ipynb
Object detection (other repo)
- MultiModalDetector
- myyolov7: Add YOLOv5 models with YOLOv7, performed training on COCO and WaymoCOCO dataset.
- myyolov5: My fork of the YOLOv5, convert COCO to YOLO format, changed the code to be the base code for YOLOv4, YOLOv5, and ScaledYOLOv4; performed training on COCO and WaymoCOCO dataset.
- WaymoObjectDetection
  - Waymo Dataset Conversion to COCO format: WaymoCOCO
  - torchvision_waymococo_train.py: performs Pytorch FasterRCNN training based on converted Waymo COCO format data. This version can be applied for any dataset with COCO format annotation
  - WaymoCOCODetectron2train.py: WaymoCOCO training based on Detectron2
  - mymmdetection2dtrain.py: Object Detection training and evaluation based on MMdetection2D
- CustomDetectron2

🧠 Deep Learning with TensorFlow (Archived)

Legacy TensorFlow implementations for reference and learning Deep learning notebooks (colab link is better)

Tensorflow introduction code: CMPE-Tensorflow1.ipynb -- colablink
Tensorflow image classification:
- Road sign data from Kaggle example: Tensorflow-Roadsignclassification.ipynb, colablink
- Flower dataset example with TF Dataset, TFRecord, Google Cloud Storage, TPU/GPU acceleration: colablink

📈 Unsupervised Learning

Dimensionality reduction, clustering, and manifold learning techniques

Unsupervised Learning Jupyter notebooks
- PCA: colablink
  - Affine Transformation via Matrix Application, eigenvalues and eigenvectors, eigendecomposition
  - Numpy/SKlearn SVD, SVD for linear regression
  - Principal Component Analysis in SKLearn, PCA as dimensionality reduction, PCA for SkLearn Iris dataset
  - PCA for Image Compression, PCA for digits and noise filtering, Grey Image Example, Color Image Example, eigenfaces, PCA vs LDA vs NCA
- Manifold Learning: colablink
  - Multidimensional Scaling (MDS), Locally Linear Embedding (LLE), Isomap Embedding, T-distributed Stochastic Neighbor Embedding for HELLO, S-Curve, and Swiss roll dataset; Isomap on Faces; Regression with Mainfold Learning
- Clustering: colablink
  - K-Means, Gaussian Mixture Models, Spectral Clustering, DBSCAN

📝 NLP and Text Mining

Natural Language Processing and Large Language Models (LLMs) implementations

📚 Text Processing Fundamentals

Text Representations - Colab Notebook

One-Hot encoding, Bag-of-Words, TF-IDF
Word2Vec with Gensim (Wikipedia and Shakespeare examples)
Data gathering from Google and WordCloud visualization

Text Extraction & NLTK - Colab Notebook

Text extraction using Textract
NLTK preprocessing pipelines

🤖 Deep Learning for NLP

TensorFlow Text Processing - Colab Notebook

Keras embedding layers
Sentiment classification
Skip-gram Word2Vec implementation

Text Classification Models - Colab Notebook

RNN, LSTM, Transformer, BERT architectures

Twitter NLP Pipeline - Colab Notebook

NLTK preprocessing
LSTM, Bi-LSTM, GRU models
BERT fine-tuning

Recommendation

Recommendation
- Recommendation via Python Surprise and Neural Collaborative Filtering (Tensorflow): colablink
- Tensorflow Recommender: colab

🤗 Hugging Face Transformers

State-of-the-art NLP models using Hugging Face Transformers library

🚀 Getting Started

Foundation Tutorials:

HuggingfaceTransformers.ipynb - Introduction notebook
huggingfacetest.py - Basic usage examples
hfdataset.py - Dataset handling
huggingfaceHPC.ipynb - HPC training setup
huggingfaceHPCdata.py - HPC data processing

📊 Text Classification

BERT Applications:

BERTMTLfakehate.py - Multi-task learning for fake news and hate speech
MLTclassifier.py - Multi-label text classification
huggingfaceClassifierNER.ipynb - Classification and NER

Multi-modal Classification:

huggingfaceclassifier2.py - Advanced multi-modal classifier
huggingfaceclassifier.py - Basic multi-modal classifier

🔄 Sequence-to-Sequence Tasks

Translation & Summarization:

huggingfaceSequence.ipynb - Translation and summarization models

Question Answering:

huggingfaceQA.py - Q&A system implementation

Conversational AI:

huggingfacechatbot.ipynb - Chatbot development

⚡ PyTorch Transformers

Custom Implementation:

torchtransformer.py - Transformer from scratch in PyTorch

Open Source LLMs

BERTLM.ipynb
Masked Language Modeling: huggingfaceLM.ipynb
llama2

LLMs Apps based on OpenAI API

openaiqa.ipynb, webcrawl.ipynb

LLMs Apps based on LangChain

langchaintest.ipynb

Large Language Models (LLMs)

Train a basic language modeling task via basic Pytorch and Torchtext WikiText2 dataset in HPC.

python nlp/torchtransformer.py

| epoch   1 |  2800/ 2928 batches | lr 5.00 | ms/batch  5.58 | loss  3.31 | ppl    27.49
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 24.00s | valid loss  1.96 | valid ppl     7.08
-----------------------------------------------------------------------------------------
| epoch   2 |   200/ 2928 batches | lr 4.75 | ms/batch  5.84 | loss  3.07 | ppl    21.57
| epoch   2 |  2800/ 2928 batches | lr 4.75 | ms/batch  5.49 | loss  2.58 | ppl    13.26
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 1655.94s | valid loss  1.52 | valid ppl     4.57
-----------------------------------------------------------------------------------------
| epoch   3 |   200/ 2928 batches | lr 4.51 | ms/batch  5.04 | loss  2.41 | ppl    11.15
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 15.41s | valid loss  1.44 | valid ppl     4.22
-----------------------------------------------------------------------------------------
=========================================================================================
| End of training | test loss  1.40 | test ppl     4.06
=========================================================================================

Train Masked Language model:

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --data_name="eli5" --model_checkpoint="distilroberta-base" --task="CLM" --subset=5000 --traintag="1115CLM" --usehpc=True --gpuid=1 --batch_size=32 --learningrate=2e-5

python nlp/huggingfaceLM2.py
data_type=huggingface data_name=eli5 dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=distilroberta-base task=MLM unfreezename= outputdir=./output traintag=1116MLM training=True usehpc=False gpuid=0 total_epochs=8 save_every=2 batch_size=32 learningrate=2e-05
Trainoutput folder: ./output\distilroberta-base\eli5_1116MLM
....
Epoch 1: Perplexity: 12.102828644322578                              | 6590/26360 [16:06:09<34:50:06,  6.34s/it]
 38%|███████████████████████>>> Epoch 2: Perplexity: 15.187707787848385                              | 9885/26360 [24:00:07<28:57:25,  6.33s/it]
 50%|███████████████████████>>> Epoch 3: Perplexity: 15.063201196763071                             | 13180/26360 [31:52:08<23:12:51,  6.34s/it]
 62%|███████████████████████>>> Epoch 4: Perplexity: 16.583895970053355                             | 16475/26360 [39:44:32<17:23:28,  6.33s/it]
 75%|███████████████████████>>> Epoch 5: Perplexity: 16.27479412837067██████▎                       | 19770/26360 [47:36:46<11:34:43,  6.33s/it]
 88%|███████████████████████>>> Epoch 6: Perplexity: 16.424729093343636██████████████████            | 23065/26360 [55:28:38<5:47:18,  6.32s/it]
100%|███████████████████████>>> Epoch 7: Perplexity: 17.22636450834783

Train GPT2 language models

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --model_checkpoint="gpt2" --task="CLM" --traintag="1115gpt2" --usehpc=True --gpuid=2 --batch_size=16

Train llama2 7b model and only unfreeze the last layers "model.layers.31" (need 500GB) or "lm_head" (need 40GB)

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceLM2.py --model_checkpoint="Llama-2-7b-chat-hf" --task="CLM" --unfreezename="lm_head" --traintag="1115llama2" --usehpc=True --gpuid=2 --batch_size=8

python nlp/huggingfaceLM2.py --model_checkpoint="Llama-2-7b-chat-hf" --pretrained=="/data/cmpe249-fa23/trainoutput/huggingface/Llama-2-7b-chat-hf/eli5_1115llama2/savedmodel.pth" --task="CLM" --unfreezename="lm_head" --traintag="1119llama2" --usehpc=True --gpuid=0 --batch_size=8
.....
Epoch 0: Perplexity: 9.858825392857694████████████████████████████████████████| 2627/2627 [12:39:17<00:00,  3.30s/it]
 25%|█████████████████▊     >>> Epoch 1: Perplexity: 10.051054027867561     | 21014/84056 [22:50:31<56:09:49,  3.21s/it]
 38%|███████████████████████>>> Epoch 2: Perplexity: 10.181400762228291     | 31521/84056

Epoch 0: Perplexity: 9.289763256151375
Epoch 1: Perplexity: 9.530650993830372
Epoch 2: Perplexity: 9.692566051540275

LLMs for Translation

Train translation models based on huggingfaceSequence

python nlp/huggingfaceSequence.py --data_name="kde4" --model_checkpoint="Helsinki-NLP/opus-mt-en-fr" --task="Seq2SeqLM" --traintag="1116" --usehpc=True --gpuid=0 --batch_size=8

epoch 0, BLEU score: 51.78█████████████████████████████████████████████████████████████████████████████████████████████████| 2628/2628 [21:18<00:00,  3.00it/s] 
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2628/2628 [22:54<00:00,  1.91it/s]
epoch 1, BLEU score: 52.73█████████████████████████████████████████████████████████████████████████████████████████████████| 2628/2628 [22:54<00:00,  2.81it/s] 
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2628/2628 [23:06<00:00,  1.90it/s]
epoch 2, BLEU score: 54.
epoch 3, BLEU score: 54.
epoch 4, BLEU score: 55.
epoch 5, BLEU score: 55.
epoch 6, BLEU score: 54.
epoch 7, BLEU score: 55.

python nlp/huggingfaceSequence.py --data_name="opus100" --model_checkpoint="facebook/wmt21-dense-24-wide-en-x" --task="Seq2SeqLM" --traintag="1121" --usehpc=True --gpuid=1 --batch_size=8

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="opus100" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --task="Seq2SeqLM" --traintag="1122" --evaluate="" --usehpc=True --
gpuid=1 --batch_size=32

python nlp/huggingfaceSequence2.py --data_name="opus100" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/Helsinki-NLP/opus-mt-en-zh/opus100_1122/savedmodel.pth" --task="Seq2SeqLM" --target_lang="zh" --traintag="1122" --evaluate=True --usehpc=True --gpuid=1 --total_epochs=16 --batch_size=32
.....
epoch 14, BLEU score: 48.55
epoch 15, BLEU score: 48.53

python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=0 --model_checkpoint="Helsinki-NLP/opus-mt-en-zh" --task="Seq2SeqLM" --target_lang="zh" --traintag="1123" --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=16 --batch_size=32

(mycondapy310) [010796032@cs002 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=10000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --traintag="1123" --useHFaccelerator=True --ev
aluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=16 --batch_size=64
.....
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1123
epoch 0, BLEU score: 2.66
epoch 1, BLEU score: 3.99
epoch 2, BLEU score: 5.31
epoch 3, BLEU score: 6.64
epoch 4, BLEU score: 8.07
epoch 5, BLEU score: 9.51
epoch 9, BLEU score: 14.39
epoch 14, BLEU score: 18.99
epoch 15, BLEU score: 19.77

(mycondapy310) [010796032@cs002 DeepDataMiningLearning]$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1123/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=32 --batch_size=64
epoch 16, BLEU score: 50.83
epoch 31, BLEU score: 56.57

python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --total_epochs=48 --batch_size=64
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124
epoch 32, BLEU score: 9.67
epoch 33, BLEU score: 13.95
epoch 35, BLEU score: 20.78
epoch 45, BLEU score: 37.60
epoch 47, BLEU score: 39.47

python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=50000 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=True --evaluate="localevaluate" --usehpc=True --total_epochs=64 --batch_size=64
epoch 48, BLEU score: 60.04
epoch 63, BLEU score: 59.26

$ python nlp/huggingfaceSequence2.py --data_name="wmt19" --subset=0 --model_checkpoint="t5-base" --task="Seq2SeqLM" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1124" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --useHFaccelerator=False --evaluate="localevaluate" --usehpc=True --gpuid=2 --total_epochs=80 --batch_size=64

Train T5-base in local computer

/nlp/huggingfaceSequence2.py
data_type=huggingface data_name=opus100 dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=Seq2SeqLM evaluate=True source_lang=en target_lang=zh source_prefix=None pretrained= unfreezename= outputdir=./output traintag=1122 training=True usehpc=False useHFaccelerator=False gpuid=0 total_epochs=8 save_every=2 batch_size=16 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: ./output\t5-base\opus100_1122
epoch 0, BLEU score: 48.70
epoch 1, BLEU score: 50.81
epoch 2, BLEU score: 50.24
epoch 3, BLEU score: 51.93
epoch 4, BLEU score: 52.34
epoch 5, BLEU score: 52.54
epoch 6, BLEU score: 52.71
epoch 7, BLEU score: 52.91
(mycondapy39) PS C:\Users\lkk68\Documents\GitHub\DeepDataMiningLearning> cat .\output\t5-base\opus100_1122\eval_results.json
   {"eval_bleu": 52.909234849408264}

nlp/huggingfaceSequence2.py
data_type=huggingface data_name=opus_books dataconfig= subset=0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=Seq2SeqLM evaluate=localevaluate source_lang=en target_lang=fr source_prefix=None pretrained= unfreezename= outputdir=./output traintag=1124 training=True usehpc=False useHFaccelerator=False gpuid=0 total_epochs=16 save_every=2 batch_size=16 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: ./output\t5-base\opus_books_1124
HF evaluator: 24.46
epoch 0, BLEU score: 24.47
HF evaluator: 26.00
epoch 14, BLEU score: 26.00
HF evaluator: 25.89
epoch 15, BLEU score: 25.90

LLMs for Summarization

Train summarization model based on "cnn_dailymail" dataset

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence3.py --data_name="cnn_dailymail" --subset=0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125" --gpuid=1 --total_epochs=8 --batch_size=32
useHFevaluator: False
dualevaluator: False
data_type=huggingface data_name=cnn_dailymail dataconfig= subset=0.0 data_path=/data/cmpe249-fa23/Huggingfacecache model_checkpoint=t5-base task=summarization hfevaluate=False dualevaluate=False source_lang=en target_lang=fr source_prefix=summarize:  pretrained= unfreezename= outputdir=./output traintag=1125 training=True usehpc=True useHFaccelerator=False gpuid=1 total_epochs=8 save_every=2 batch_size=32 learningrate=2e-05 lr_scheduler_type=linear weight_decay=0.0 gradient_accumulation_steps=1 pad_to_max_length=True max_source_length=128 max_target_length=128 num_beams=1
Trainoutput folder: /data/cmpe249-fa23/trainoutput/huggingface/t5-base/cnn_dailymail_1125

epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.41197173103605056, recall=0.3119407931701639, fmeasure=0.3419831177812576), mid=Score(precision=0.41485220723500893, recall=0.3142588821867073, fmeasure=0.3441474199740058), high=Score(precision=0.417600698303463, recall=0.3165054333064982, fmeasure=0.3464475097686251)), 'rouge2': AggregateScore(low=Score(precision=0.17079493326536863, recall=0.12954017819776567, fmeasure=0.14168846158539045), mid=Score(precision=0.1732732332588367, recall=0.13143856889123576, fmeasure=0.14365393002489873), high=Score(precision=0.17556212138667857, recall=0.1333017602307039, fmeasure=0.14553710285584648)), 'rougeL': AggregateScore(low=Score(precision=0.2965678275511713, recall=0.22464618504694173, fmeasure=0.24595299928001602), mid=Score(precision=0.2989426482943214, recall=0.22661167405971072, fmeasure=0.24784869235990342), high=Score(precision=0.30157116940828527, recall=0.22857528928797705, fmeasure=0.24990879681324848)), 'rougeLsum': AggregateScore(low=Score(precision=0.29642059763679396, recall=0.22452637387862667, fmeasure=0.2458545440066565), mid=Score(precision=0.298870626902608, recall=0.22656017215162805, fmeasure=0.24780842037777753), high=Score(precision=0.3013588263275077, recall=0.22874002983645225, fmeasure=0.24986052268174044))}
.....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.4106620116159846, recall=0.3359116786022911, fmeasure=0.3571395048710354), mid=Score(precision=0.4113719796401594, recall=0.3365233125614775, fmeasure=0.3577153054159934), high=Score(precision=0.41210309679230933, recall=0.3371437303658093, fmeasure=0.3583162407770864)), 'rouge2': AggregateScore(low=Score(precision=0.17094983781529516, recall=0.140263702081492, fmeasure=0.1487497359572444), mid=Score(precision=0.17152532221996442, recall=0.14077116290613728, fmeasure=0.1492659697570951), high=Score(precision=0.17217316198104332, recall=0.14133598283522728, fmeasure=0.14982901584825192)), 'rougeL': AggregateScore(low=Score(precision=0.29437670375737585, recall=0.241532114907138, fmeasure=0.2562224943410107), mid=Score(precision=0.295004497912973, recall=0.24202153711108693, fmeasure=0.256706856828388), high=Score(precision=0.29558396252033226, recall=0.2425496385594932, fmeasure=0.25720842817590284)), 'rougeLsum': AggregateScore(low=Score(precision=0.2943661396563572, recall=0.24149928244505112, fmeasure=0.2561939698587155), mid=Score(precision=0.29501206428652565, recall=0.24205477312718807, fmeasure=0.25673184483277667), high=Score(precision=0.2956055704727816, recall=0.2425905810654754, fmeasure=0.25723630538221925))}

Train summarization based on "billsum" dataset

python nlp/huggingfaceSequence3.py --data_name="billsum" --subset=0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125" --gpuid=2 --total_epochs=8 --batch_size=64
epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.47626942983509496, recall=0.25432053054933895, fmeasure=0.3104459523870372), mid=Score(precision=0.48118215557348354, recall=0.25758086912668254, fmeasure=0.31351244843801257), high=Score(precision=0.48633004345160047, recall=0.2610377756856256, fmeasure=0.31666293379599264)), 'rouge2': AggregateScore(low=Score(precision=0.22553581027732506, recall=0.1163356040224796, fmeasure=0.1426990222648773), mid=Score(precision=0.22980182467297072, recall=0.11916438705516172, fmeasure=0.14559973720411068), high=Score(precision=0.23448683629864495, recall=0.12186100986547159, fmeasure=0.14852517085299424)), 'rougeL': AggregateScore(low=Score(precision=0.37475945811360656, recall=0.20056913183887007, fmeasure=0.24410014882029277), mid=Score(precision=0.3791703619589465, recall=0.2035902141913645, fmeasure=0.24684120728025757), high=Score(precision=0.38361974259418913, recall=0.20659199136456374, fmeasure=0.24957227859403597)), 'rougeLsum': AggregateScore(low=Score(precision=0.3748793629512802, recall=0.200607724566369, fmeasure=0.24403997308680897), mid=Score(precision=0.379214986503034, recall=0.20357105315951063, fmeasure=0.24678602914830827), high=Score(precision=0.38368049681408073, recall=0.20677635875973946, fmeasure=0.2497749989951042))}
.....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.4762322399523528, recall=0.3612750168693367, fmeasure=0.3859539337308937), mid=Score(precision=0.47741449894622395, recall=0.36223979353915375, fmeasure=0.3867387680371833), high=Score(precision=0.47857623413135664, recall=0.36315881475914685, fmeasure=0.387529008724504)), 'rouge2': AggregateScore(low=Score(precision=0.23795846378692775, recall=0.17739298212614543, fmeasure=0.1895061898823578), mid=Score(precision=0.23901236113717203, recall=0.17821036280792862, fmeasure=0.1902598464683951), high=Score(precision=0.24012950529243016, recall=0.1789825350527308, fmeasure=0.19103010704737328)), 'rougeL': AggregateScore(low=Score(precision=0.38008605052931843, recall=0.28951287826449035, fmeasure=0.307987126768515), mid=Score(precision=0.3811360590765148, recall=0.29038102311603803, fmeasure=0.30870601375792006), high=Score(precision=0.38221337267972383, recall=0.2911927830141367, fmeasure=0.3094361849338897)), 'rougeLsum': AggregateScore(low=Score(precision=0.3800857948090643, recall=0.2895777984554257, fmeasure=0.3080188262526704), mid=Score(precision=0.38110308526126024, recall=0.2903936893613658, fmeasure=0.3087163027737505), high=Score(precision=0.3821455084830249, recall=0.29123147942293837, fmeasure=0.3094164695199439))}

Train summarization based on "xsum" dataset

(mycondapy310) [010796032@cs001 DeepDataMiningLearning]$ python nlp/huggingfaceSequence3.py --data_name="xsum" --subset=
0 --model_checkpoint="t5-base" --training --usehpc --task="summarization" --source_prefix="summarize: " --traintag="1125
" --gpuid=1 --total_epochs=8 --batch_size=64
epoch 0, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.2206634425318806, recall=0.27398080027290334, fmeasure=0.23345078599469463), mid=Score(precision=0.2228479631573893, recall=0.2758314532199041, fmeasure=0.2352387063491203), high=Score(precision=0.22497474810005727, recall=0.27769139479498234, fmeasure=0.23719704189066906)), 'rouge2': AggregateScore(low=Score(precision=0.04855549563663333, recall=0.053939101970654914, fmeasure=0.04875180161702199), mid=Score(precision=0.049879370509959026, recall=0.05516244414389607, fmeasure=0.04995842715629054), high=Score(precision=0.05118590600673647, recall=0.05640622198150646, fmeasure=0.05111651526817903)), 'rougeL': AggregateScore(low=Score(precision=0.1618905690819521, recall=0.19780661742397618, fmeasure=0.16987613018217562), mid=Score(precision=0.16369315145103713, recall=0.19940773965682818, fmeasure=0.17143884336484203), high=Score(precision=0.16554847486855678, recall=0.20099105646119897, fmeasure=0.17303926810412723)), 'rougeLsum': AggregateScore(low=Score(precision=0.16171036371143727, recall=0.19775741367206912, fmeasure=0.16972996097941762), mid=Score(precision=0.16364819272318623, recall=0.19936460332720274, fmeasure=0.1713745081258375), high=Score(precision=0.16540503402328743, recall=0.20093144657877907, fmeasure=0.17297056660820934))}
....
epoch 7, evaluation metric: rouge
Evaluation result: {'rouge1': AggregateScore(low=Score(precision=0.3157722202905392, recall=0.30514034915305777, fmeasure=0.30098661341663313), mid=Score(precision=0.3164460427364083, recall=0.30573870098459877, fmeasure=0.3015941401174891), high=Score(precision=0.31710221504502506, recall=0.30627353921164463, fmeasure=0.3021231398386561)), 'rouge2': AggregateScore(low=Score(precision=0.09545526361853192, recall=0.08805500161867533, fmeasure=0.08929094188051703), mid=Score(precision=0.0959138137320461, recall=0.08846040412480724, fmeasure=0.08970127642926173), high=Score(precision=0.09638196118158575, recall=0.08887820756655522, fmeasure=0.09011808402709783)), 'rougeL': AggregateScore(low=Score(precision=0.2439907968202854, recall=0.23420304190160135, fmeasure=0.23192429294508443), mid=Score(precision=0.2445882204337065, recall=0.23468338778240327, fmeasure=0.23243230200308637), high=Score(precision=0.2451722573264371, recall=0.23517123971453874, fmeasure=0.2329239687124055)), 'rougeLsum': AggregateScore(low=Score(precision=0.24400824561132867, recall=0.23421060757686038, fmeasure=0.2319498933428253), mid=Score(precision=0.24459781572922448, recall=0.23469061202638752, fmeasure=0.2324341602958753), high=Score(precision=0.24514334749025868, recall=0.23516229230273153, fmeasure=0.23293317648718587))}

Run question and answering for squad dataset based on huggingfaceSequence4.py.

nlp/huggingfaceSequence4.py

HF evaluator: {'exact_match': 0.22, 'f1': 6.222554522104021}
Start training, total steps: 79696
epoch 0, evaluation metric: squad
Evaluation result: {'exact_match': 63.36, 'f1': 77.10714274394753}
epoch 15, evaluation metric: squad
Evaluation result: {'exact_match': 62.08, 'f1': 75.95170387159816}

LLMs for Question and Answering

Run question and answering for squad dataset based on custom bert model in huggingfaceSequence4.py.

nlp/huggingfaceSequence4.py
epoch 0: {'exact_match': 0.7663197729422895, 'f1': 0.8230842005676446}
epoch 1: {'exact_match': 0.7947019867549668, 'f1': 0.8360138757489753}
epoch 5: {'exact_match': 0.8609271523178808, 'f1': 0.8607573442010529}
epoch 7: {'exact_match': 0.8703878902554399, 'f1': 0.8735414695679595}

Run open-ended question and answering for squadv2 dataset based on T5 model in huggingfaceSequence5.py

epoch 0, evaluation metric: squad_v2
Evaluation result: {'exact': 75.42, 'f1': 81.2122560694999, 'total': 5000, 'HasAns_exact': 73.17148125384142, 'HasAns_f1': 82.07169033420416, 'HasAns_total': 3254, 'NoAns_exact': 79.61053837342497, 'NoAns_f1': 79.61053837342497, 'NoAns_total': 1746, 'best_exact': 75.36, 'best_exact_thresh': 0.0, 'best_f1': 81.15225606949971, 'best_f1_thresh': 0.0}
......
epoch 15, evaluation metric: squad_v2
Evaluation result: {'exact': 76.38, 'f1': 82.8391234679352, 'total': 5000, 'HasAns_exact': 73.26367547633681, 'HasAns_f1': 83.18857324513708, 'HasAns_total': 3254, 'NoAns_exact': 82.18785796105384, 'NoAns_f1': 82.18785796105384, 'NoAns_total': 1746, 'best_exact': 76.32, 'best_exact_thresh': 0.0, 'best_f1': 82.77912346793502, 'best_f1_thresh': 0.0}

Multi-task LLMs

Use Sequence5.py for translation training

python nlp/huggingfaceSequence5.py --data_name="wmt19" --subset=100000 --model_checkpoint="t5-base" --task="translation" --target_lang="zh" --source_prefix="translate English to Chinese: " --traintag="1206" --pretrained="/data/cmpe249-fa23/trainoutput/huggingface/t5-base/wmt19_1124/savedmodel.pth" --usehpc --gpuid=1 --total_epochs=80 --batch_size=64

Fine tune "liam168/trans-opus-mt-en-zh" on wmt19 5000 subset

epoch 15, evaluation metric: sacrebleu
Evaluation result: {'score': 45.11294317735652, 'counts': [200677, 140858, 104286, 81486], 'totals': [283438, 274938, 266464, 258142], 'precisions': [70.8010217402042, 51.232641541001975, 39.1369941155278, 31.566347204251922], 'bp': 0.9805091347328498, 'sys_len': 283438, 'ref_len': 289017}

📡 SignalAI

Signal processing, time series analysis, and AI-driven signal intelligence

🎵 Audio Classification

Perform audio classification via hfclassify1.py:

   {'eval_loss': 0.8612403869628906, 'eval_accuracy': 0.8342342342342343, 'eval_runtime': 244.7536, 'eval_samples_per_second': 9.07, 'eval_steps_per_second': 0.568, 'epoch': 8.0}                                                                                     
   {'train_runtime': 26052.4526, 'train_samples_per_second': 6.133, 'train_steps_per_second': 0.384, 'train_loss': 1.0575269655821498, 'epoch': 8.0}
   ***** eval metrics *****
   epoch                   =        8.0
   eval_accuracy           =     0.8342
   eval_loss               =     0.8612
   eval_runtime            = 0:04:41.22
   eval_samples_per_second =      7.894
   eval_steps_per_second   =      0.494

🌍 Language Classification

Dataset: Common Language Dataset

🎙️ Wave2Vec2 Models

Fine-tuning Notebook: signalAI/hfwave2vec2_finetune.ipynb

Pre-training Script: signalAI/hfwave2vec2.py

Training Command:

python signalAI/hfwave2vec2.py

Name		Name	Last commit message	Last commit date
Latest commit History 868 Commits
.vscode		.vscode
DeepDataMiningLearning		DeepDataMiningLearning
dataapps		dataapps
docker		docker
docs		docs
images		images
nlp		nlp
notebooks		notebooks
sampledata		sampledata
scripts		scripts
signalAI		signalAI
workspace		workspace
.DS_Store		.DS_Store
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

lkk688/DeepDataMiningLearning

Folders and files

Latest commit

History

Repository files navigation