Skip to content

Commit 4a565c4

Browse files
assayireassayire
andauthored
Website Scraper (#12)
* Scrape rockthejvm.com articles * Added scraper project * Updated README * Using Ethereal email instead of SendGrid * Updated documentation * Updated documentation * Updated documentation --------- Co-authored-by: assayire <[email protected]>
1 parent 82d0d87 commit 4a565c4

File tree

16 files changed

+239
-15
lines changed

16 files changed

+239
-15
lines changed

build.sbt

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
1-
name := "scala-projects-playground"
2-
3-
version := "0.1"
4-
1+
name := "scala-projects-playground"
2+
version := "0.1"
53
scalaVersion := "3.3.4"
64

75
libraryDependencies ++= Seq(
@@ -13,7 +11,8 @@ libraryDependencies ++= Seq(
1311
"com.lihaoyi" %% "fastparse" % "3.1.1",
1412
// Java libraries
1513
// scraping
16-
"org.jsoup" % "jsoup" % "1.19.1",
14+
"org.jsoup" % "jsoup" % "1.20.1",
15+
"org.scala-lang.modules" %% "scala-parallel-collections" % "1.2.0",
1716
// markdown
1817
"org.commonmark" % "commonmark" % "0.24.0",
1918
// http apis

chat-app/README.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,9 @@
1010
## Running the application
1111

1212
1. Run the client with `sbt "~appJS/fastOptJS"` to keep the client files up to date with the changes you make to `js (appJS)` project.
13-
2. TODO: Figure out how to run the server from command line. `sbt runMain com.rtjvm.chat.backend.Server` or `sbt run` does not launch the server. **For now, you should be able to run it from IntelliJ**. Go to `Server.scala` and run the file.
14-
3. Chat application should be accessible at http://localhost:8080/static/index.html
13+
2. You can run the server from the IDE by loading the entire project and running the `Server` class.
14+
3. Or you can run the server from the command line from _within the `chat-app` folder_: `sbt runMain com.rtjvm.chat.backend.Server`.
15+
4. Chat application should be accessible at http://localhost:8080/static/index.html
1516

1617
## Project Info
1718

@@ -37,7 +38,7 @@ Not using `synchronized` but using a `ConcurrentHashMap`.
3738

3839
**The online examples so far provide a simple test suite, that uses `String.contains` .... Use the Jsoup library we saw Chapter 11: Scraping Websites to make ... tag**
3940

40-
Not Implemented! TBD!
41+
Not Implemented! As we discussed, we are not doing any tests.
4142

4243
**Keep track each message's send time and date in the database, and display it in the user interface**
4344

chat-app/build.sbt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ lazy val app =
1212
.in(file("."))
1313
.settings(
1414
name := "chat-app",
15+
fork := true,
1516
libraryDependencies ++=
1617
"com.lihaoyi" %%% "upickle" % "4.1.0" ::
1718
"com.lihaoyi" %%% "scalatags" % "0.12.0" ::

chat-app/jvm/src/main/scala/com/rtjvm/chat/backend/Server.scala

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@ object Server extends cask.MainRoutes {
3434

3535
@cask.postJson("/chat")
3636
def postChatMsg(
37-
sender: String,
38-
msg: String,
39-
parent: Option[Long] = None,
40-
timestamp: Option[Long] = None
37+
sender: String,
38+
msg: String,
39+
parent: Option[Long] = None,
40+
timestamp: Option[Long] = None
4141
): ujson.Value =
4242
(sender.trim, msg.trim) match
4343
case ("", _) => writeJs(ChatResponse.error("Name cannot be empty"))
@@ -84,7 +84,12 @@ object Server extends cask.MainRoutes {
8484
write(Greeting(s"Hello $name, from Scala.js backend! $token"))
8585

8686
@cask.staticFiles("/static")
87-
def staticFileRoutes() = "chat-app/js/static"
87+
def staticFileRoutes(): String =
88+
val userDir = System.getProperty("user.dir")
89+
val staticPath = os.Path(userDir) / "chat-app" / "js" / "static"
90+
91+
if os.exists(staticPath) then staticPath.toString // when running from IDE
92+
else (os.Path(userDir) / ".." / "js" / "static").toString // when running from chat-app folder on the command line
8893

8994
private def createDataDir(): String =
9095
val dataDir = os.home / "pgdata"

filesync/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,15 @@ Also, when running the app from IntelliJ, configure the run configuration for `s
66

77
## Exercises
88

9-
- Syncing folders/sub-folders
9+
- **Syncing folders/sub-folders**
1010

1111
Track `Rpc.CreateFolder` case class
1212

13-
- Syncing deleted files/folders
13+
![etc/create_folder.png](etc/create_folder.png)
14+
15+
- **Syncing deleted files/folders**
1416

1517
Track `Rpc.DeletePath` case class
1618

19+
![etc/delete_path.png](etc/delete_path.png)
20+

filesync/etc/create_folder.png

79.3 KB
Loading

filesync/etc/delete_path.png

74.2 KB
Loading

scrape/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Scrape
2+
3+
News headlines scraper using Jsoup, Quartz scheduler and Ethereal email.
4+
5+
P.S: There is also another scraper for scraping Rock the JVM blog posts under [`scraping`](../src/main/scala/scraping).

scrape/build.sbt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
lazy val scrape =
2+
project
3+
.in(file("."))
4+
.settings(
5+
name := "scrape",
6+
version := "0.1.0-SNAPSHOT",
7+
scalaVersion := "3.7.0",
8+
libraryDependencies ++=
9+
"org.jsoup" % "jsoup" % "1.20.1" ::
10+
"org.scala-lang.modules" %% "scala-parallel-collections" % "1.2.0" ::
11+
"org.quartz-scheduler" % "quartz" % "2.5.0" ::
12+
"org.quartz-scheduler" % "quartz-jobs" % "2.5.0" ::
13+
"com.sun.mail" % "javax.mail" % "1.6.2" ::
14+
Nil
15+
)

scrape/project/build.properties

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
sbt.version=1.10.11
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
package scrape
2+
3+
import java.util.{Properties, UUID}
4+
import javax.mail.*
5+
import javax.mail.internet.*
6+
7+
object Ethereal:
8+
def sendEmail(to: String, subject: String, body: String): Unit = {
9+
val session =
10+
smtpSession(
11+
System.getenv("SMTP_USERNAME"),
12+
System.getenv("SMTP_PASSWORD"),
13+
smtpProperties()
14+
)
15+
16+
try {
17+
Transport.send {
18+
val msg = new MimeMessage(session)
19+
msg.setFrom(new InternetAddress("[email protected]"))
20+
msg.setRecipients(Message.RecipientType.TO, to)
21+
msg.setSubject(subject)
22+
msg.setContent(body, "text/html")
23+
msg.setHeader("Message-ID", UUID.randomUUID().toString)
24+
msg
25+
}
26+
27+
println("Email sent successfully!")
28+
} catch {
29+
case e: MessagingException =>
30+
e.printStackTrace()
31+
}
32+
}
33+
34+
private def smtpSession(email: String, password: String, props: Properties): Session = {
35+
Session.getInstance(
36+
props,
37+
new Authenticator {
38+
override def getPasswordAuthentication: PasswordAuthentication = {
39+
new PasswordAuthentication(email, password)
40+
}
41+
}
42+
)
43+
}
44+
45+
private def smtpProperties(): Properties = {
46+
val props = new Properties()
47+
props.put("mail.smtp.host", "smtp.ethereal.email")
48+
props.put("mail.smtp.port", "587")
49+
props.put("mail.smtp.auth", "true")
50+
props.put("mail.smtp.starttls.enable", "true") // For TLS
51+
props
52+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
package scrape
2+
3+
import org.jsoup.Jsoup
4+
5+
import scala.collection.parallel.CollectionConverters.*
6+
import scala.jdk.CollectionConverters.*
7+
8+
case class Headline(title: String, url: String)
9+
10+
object Guardian:
11+
private final val UserAgent =
12+
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.101.76 Safari/537.36"
13+
14+
private val pageSelectorMap =
15+
"https://www.theguardian.com/us" -> "#container-news>ul>li a" ::
16+
"https://www.theguardian.com/world" -> "div[id *= container-]>ul>li a" ::
17+
"https://www.theguardian.com/us/sport" -> "div#container-sports>ul>li a" ::
18+
Nil
19+
20+
def scrapeHeadlines(): Seq[Headline] =
21+
pageSelectorMap.par.flatMap { case (url, selector) =>
22+
Jsoup
23+
.connect(url)
24+
.userAgent(UserAgent)
25+
.get()
26+
.select(selector)
27+
.asScala
28+
.map { a =>
29+
val title = if a.text().isEmpty then a.attr("aria-label") else a.text
30+
Headline(title, a.attr("href"))
31+
}
32+
}.seq
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
package scrape
2+
3+
import org.quartz.{Job, JobExecutionContext}
4+
5+
class NewsAlertJob extends Job:
6+
def execute(context: JobExecutionContext): Unit = {
7+
val body =
8+
Guardian
9+
.scrapeHeadlines()
10+
.map(h => s"<li><a href=\"${h.url}\">${h.title}</a></li>")
11+
.mkString("<div><ul>", "\n\n", "</ul></div>")
12+
13+
Ethereal.sendEmail(
14+
15+
"News Headlines",
16+
body
17+
)
18+
}
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
package scrape
2+
3+
import org.quartz._
4+
import org.quartz.impl.StdSchedulerFactory
5+
6+
object NewsAlertScheduler:
7+
private final val JobGroup = "newsAlertJobGroup"
8+
9+
def main(args: Array[String]): Unit =
10+
val scheduler = StdSchedulerFactory.getDefaultScheduler
11+
scheduler.start()
12+
13+
val job = JobBuilder
14+
.newJob(classOf[NewsAlertJob])
15+
.withIdentity("newsAlertJob", JobGroup)
16+
.build()
17+
18+
val trigger = TriggerBuilder
19+
.newTrigger()
20+
.withIdentity("newsAlertTrigger", JobGroup)
21+
.startNow()
22+
.withSchedule(
23+
SimpleScheduleBuilder
24+
.simpleSchedule()
25+
.withIntervalInSeconds(10)
26+
.repeatForever()
27+
)
28+
.build()
29+
30+
scheduler.scheduleJob(job, trigger)

src/main/scala/blog/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Blog
2+
3+
Publishing the blog is done using GitHub Actions. See [publish-blog.yml](../../../../.github/workflows/publish-blog.yml)
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
package scraping
2+
3+
import org.jsoup.Jsoup
4+
5+
import scala.collection.parallel.CollectionConverters.*
6+
import scala.jdk.CollectionConverters.*
7+
8+
case class Article(title: String, url: String, tags: Seq[String])
9+
10+
/**
11+
* Crawls the RockTheJVM blog posts by chunking and scraping in parallel, and returns a map of tags to articles
12+
*/
13+
object RockTheJVM extends App {
14+
private val noOfPages = scrapNoOfPages()
15+
println(s"NoOfPages: $noOfPages")
16+
17+
private val tagArticlesMap: Map[String, List[Article]] =
18+
(1 to noOfPages)
19+
.grouped(5)
20+
.toVector
21+
.par
22+
.flatMap { group =>
23+
println(s"Processing batch: ${group.min} to ${group.max}")
24+
25+
group.flatMap { page =>
26+
println(s"Processing page: https://rockthejvm.com/articles/$page")
27+
Jsoup
28+
.connect(s"https://rockthejvm.com/articles/$page")
29+
.get()
30+
.select("article")
31+
.asScala
32+
.map { article =>
33+
val title = article.select("h2").text()
34+
val url = article.select("a[href^=\"/articles/\"]").attr("href")
35+
val tags = article.select("div>a[href^=\"/tags/\"]").asScala.map(_.text()).toList
36+
Article(title, url, tags)
37+
}
38+
}
39+
}
40+
.flatMap(article => article.tags.map(tag => (tag, article)))
41+
.seq // Convert back to a sequential collection
42+
.groupMap(_._1)(_._2)
43+
.view
44+
.mapValues(_.toList)
45+
.toMap
46+
47+
private def scrapNoOfPages(): Int =
48+
Jsoup
49+
.connect("https://rockthejvm.com/articles/1")
50+
.get()
51+
.select("footer>nav>div.hidden")
52+
.first()
53+
.select("a[href*=\"/articles/\"]:last-child")
54+
.text()
55+
.toInt
56+
57+
println(tagArticlesMap.mkString("\n"))
58+
}

0 commit comments

Comments
 (0)