Douban Movie Top250 Comment Scraper

使用 Requests 请求 HTML 页面内容，使用 BeautifulSoup4 提取“豆瓣电影 Top250”电影元数据和每部电影的最新评论，并按电影分别保存 JSON。

安装依赖

.\.venv\Scripts\python.exe -m pip install -r requirements.txt

Cookie 填写位置

如果需要爬取登录后才能访问的评论页，可以二选一：

在 douban_movie_scraper/config.py 中填写：

DOUBAN_COOKIE = "你的豆瓣 Cookie"

运行时传入：

.\.venv\Scripts\python.exe main.py --cookie "你的豆瓣 Cookie"

运行示例

爬取 Top 10 电影，每部电影前 2 页最新评论：

.\.venv\Scripts\python.exe main.py --top-m 10 --pages 2

最新评论默认使用 --comment-sort time，如需改成豆瓣的其他排序参数，可以通过 --comment-sort 指定。

输出目录默认为 data/douban_movie_top250。每部电影单独保存为 {movie_id}.json，同时生成 movies_index.json 方便可视化云服务整体读取。

数据模型

单部电影 JSON 格式：

{
  "movie_id": "1292052",
  "movie_title": "肖申克的救赎",
  "movie_rating": 9.7,
  "comment_list": [
    {
      "movie_comment_cid": "123456",
      "movie_comment_timestamp": 1710000000,
      "movie_comment_rating": 5,
      "movie_comment_content": "评论内容"
    }
  ]
}

缺失评分清洗策略

部分评论没有评分字段，默认使用 --rating-strategy drop 删除该评论。也可以使用：

--rating-strategy zero：缺失评分填充为 0
--rating-strategy none：保留为 null

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
data/douban_movie_top250		data/douban_movie_top250
douban_movie_scraper		douban_movie_scraper
tests		tests
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Douban Movie Top250 Comment Scraper

安装依赖

Cookie 填写位置

运行示例

数据模型

缺失评分清洗策略

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Douban Movie Top250 Comment Scraper

安装依赖

Cookie 填写位置

运行示例

数据模型

缺失评分清洗策略

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages