Term Project for COMP7305 Cluster and Cloud Computing.
Developed By:
.
├── CloudWeb
├── Collector
├── HBaser
├── StreamProcessorFlink
├── StreamProcessorSpark
|
└── pom.xml(Maven parent POM)
- CloudWeb:
- Show statistics data and sentiment analysis result.
- Collector:
- Collect real-time data by Twitter Python Api.
- Transform data to Kafka by Flume.
- StreamProcessorFlink:
- Analyze statistics from different dimensions.
- StreamProcessorSpark:
- Train and generate Naive Bayes Model.
- Analyze sentiment of Twitter.
Need to connect with cs vpn.
Spark-MLlib-Twitter-Sentiment-Analysis
数据走向:
Flume-> Kafka -> Spark Streaming -> Kafka
-> Flink -> Kafka
-> DL4J -> Kafka
Flume 将Twitter Data 搬运存储到 topic : alex1
供 Spark & Flink & DL4J 订阅。
Spark Streaming 读取topic : alex1
进行情感分析,存储结果数据到 topic : twitter-result1
供Web端订阅。
Cloud Web DL4J 读取topic : alex1
进行 dl4j 情感分析,结果数据不存储,直接吐到WebSocket监听的路由里,供Web端订阅。
Flink 读取topic : alex1
进行数据统计分析,
- twitter 语言统计结果存储到
topic : twitter-flink-lang
供Web端订阅。 - twitter 用户fans统计结果存储到
topic : twitter-flink-fans
供Web端订阅。 - twitter 用户geo统计结果存储到
topic : twitter-flink-geo
供Web端订阅。
数据格式:
- Twitter 元数据
twitter4j.Status
- 情感分析结果
ID¦Name¦Text¦NLP¦MLlib¦DeepLearning¦Latitude¦Longitude¦Profile¦Date
- Lang 统计结果
{"pt":2,"ot":26,"ja":3,"en":453,"fr":12,"es":4,}
- Fans 统计结果
200|800|500~1000|above 1000
- map 统计结果
Latitude|Longitude|time
-
HDFS 配置项
-
Naive Bayes 模型路径
/tweets_sentiment/NBModel/
-
Naive Bayes 训练/测试文件路径
/data/training.1600000.processed.noemoticon.csv
-
Stanford Core NLP 模型路径,maven 依赖中 stanford-corenlp-models
-
Deep Learning 模型&词向量路径
/tweets_sentiment/dl4j/
-
-
Make preparation for Kafka environment. We have set up kafka on total 9 machines. Create kafka topic and make sure we can consume and produce the topic.
-
Make preparation for Flume environment. We have set up flume GPU7.
-
Generate Model for Machine Learning lib, run the command and put the model on the HDFS.
spark-submit --class "hk.hku.spark.mllib.SparkNaiveBayesModelCreator" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
-
The Model for CoreNLP has been stored in NLP-jars, so we only need to make sure the maven dependency is completed.
-
The Model for DL4J has been stored in the project, so we only need to make sure the word vectors (stored on HDSF) are completed.
-
The CloudWeb Spring Boot project includes DL4J and this makes it need about 6G memory. Keep mind.
-
Log in the GPU machine,
cd /opt/spark-twitter/7305CloudProject
,git pull
code andmvn compile
project.
- Start Flume to collect twitter data and transport into Kafka.
# read boot_flume_sh
nohup flume-ng agent -f /opt/spark-twitter/7305CloudProject/Collector/TwitterToKafka.conf -Dflume.root.logger=DEBUG,console -n a1 >> flume.log 2>&1 &
# make sure data has been produced into kafka topic successfully
We can start 3 or more flume progress to test the cluster performance.
- Start Spark Streaming to analysis twitter text sentiment using stanford nlp & naive bayes.
单机模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master local[3] /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
集群模式
spark-submit --class "hk.hku.spark.TweetSentimentAnalyzer" --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 4g --executor-cores 4 --driver-memory 4g --conf spark.kryoserializer.buffer.max=2048 --conf spark.yarn.executor.memoryOverhead=2048 /opt/spark-twitter/7305CloudProject/StreamProcessorSpark/target/StreamProcessorSpark-jar-with-dependencies.jar
Watch the Spark Status on the Spark History Server website.
- Start CloudWeb to show the result on the website.
cd /opt/spark-twitter/7305CloudProject/CloudWeb/target
nohup java -Xmx3072m -jar /opt/spark-twitter/7305CloudProject/CloudWeb/target/CloudWeb-1.0-SNAPSHOT.jar &
Watch the Flink Status on the Flink History Server website.
- Start Flink
flink run /opt/spark-twitter/7305CloudProject/StreamProcessorFlink/target/StreamProcessorFlink-1.0-SNAPSHOT.jar
- Modify Zsh Environment
# 使用zsh, 自定义环境变量需要修改:
vi ~/sh/env_zsh
# then restart the shell
- CloudWeb start failed
The most possible reason is the machine's memory isn't enough. Use command top
to watch the status.