Skip to content

Commit 02ede37

Browse files
author
ProgrammingIncluded
committed
v0.4.0: More Algorithms
1 parent 419f13c commit 02ede37

File tree

5 files changed

+206
-15
lines changed

5 files changed

+206
-15
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# CHANGELOG
22

3+
## 0.3.1: User README Updates
4+
5+
* Add example README
6+
* Update README
7+
* Add scroll algorithm selection
8+
39
## 0.3.0: Remove Ads and Average Scrolling
410

511
* Add average scrolling to compensate for scroll heights

README.md

Lines changed: 106 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,34 @@
1-
# Birdwatch: A Twitter Profile Snapshot Tool
21

3-
Birdwatch snapshots a profile page when given a URL or an exported list of `following` from the official Twitter exporter.
4-
This script is purely for the purposes of archival use-only.
2+
<p align="center">
3+
<img height="418" src="logo.png" alt="Birdwatch Logo">
4+
</p>
55

6-
**Note, without logging in, you can only fetch a few posts from the profile.**
6+
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
7+
8+
# Birdwatch: A Twitter Profile Archival Tool
9+
10+
11+
Birdwatch snapshots a profile page when given a URL (or an exported `.js` list of `following` from the official Twitter exporter.)
12+
Supports UTF-8 text JSON files and image snapshots of each Twitter post!
13+
14+
This script is purely for the purposes of archival use only.
15+
16+
![Birdwatch](demo.gif)
17+
18+
## Features
19+
20+
* Stores metadata in json format for each specified twitter profile.
21+
* Neatly organizes tweets by user and takes a snapshot of each tweet.
22+
* Marks potential tweets that are self-retweeted.
23+
* Removes Tweet Ads.
24+
* Allows for manual login (use at own risk.)
725

826
## Usage
927

1028
```bash
29+
# Install the requirements. Once only.
1130
python -m pip install -r requirements.py
12-
python birdwatch.py
31+
python birdwatch.py --url www.twitter.com/<profile>
1332

1433
# For more help use:
1534
python birdwatch.py --help
@@ -32,3 +51,85 @@ Birdwatch generates the following in the snapshots folder:
3251
9.png
3352
tweets.json # Metadata of each screen-capped tweet.
3453
```
54+
55+
### Self Boosted Tweet Detection
56+
57+
A self-boosted tweet is a tweet where the original author retweets.
58+
These types of tweets are marked with `potential_boost` as true in `tweets.json`.
59+
The script detects these by matching exact meta-datas e.g. duplicate posts.
60+
61+
## Schemas
62+
63+
Assume all data is UTF-8 compliant.
64+
65+
### Input File
66+
67+
These files are what the Twitter exporter should generate (`.js` file) from the users you are following:
68+
69+
```json
70+
window.* = [
71+
{
72+
"following": {
73+
"accountId": <id>,
74+
"userLink": <url>
75+
}
76+
...
77+
}
78+
]
79+
```
80+
81+
You can rename as json or specify via input flags to parse the file. `window.* =` is automatically removed by the script and is default generated by Twitter. However, you can also manually remove it to parse the file as JSON directly.
82+
83+
### tweets.json
84+
85+
```json
86+
[
87+
{
88+
"id": int,
89+
"tag_text": str,
90+
"name": str,
91+
"handle" str,
92+
"timestamp": str,
93+
"tweet_text": str,
94+
"retweet_count": str,
95+
"like_count": str,
96+
"reply_count": str,
97+
"potential_boost": bool
98+
}
99+
]
100+
```
101+
102+
Invalid string entries will be marked as "NULL".
103+
104+
### metadata.json
105+
106+
```json
107+
{
108+
"bio": str,
109+
"name": str,
110+
"username": str,
111+
"location": str,
112+
"website": str,
113+
"join_date": str,
114+
"following": str,
115+
"followers": str
116+
}
117+
```
118+
119+
Invalid string entries will be marked as "NULL".
120+
121+
122+
## Troubleshoot
123+
124+
* My scraper terminates early?
125+
126+
It is possible that either your images are taking sometime to load Consider using `-s` to adjust load-time.
127+
Or your scrolling height is too low / too high. Consider using `--scroll-algorithm` to adjust the type of algorithm
128+
Then passing in a value to the algorithm `--scroll-value`.
129+
130+
Help has more information as to what `--scroll-value` encodes.
131+
132+
## Future Updates
133+
134+
* Support Running Multiple Sessions to Resume Per-Profile
135+
* Expand Images and Attachments to Archive Images

birdwatch.py

Lines changed: 94 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,59 @@
2626
from webdriver_manager.chrome import ChromeDriverManager
2727
from selenium.webdriver.support.ui import WebDriverWait
2828

29+
LOGO = """
30+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
31+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
32+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
33+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
34+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'''''''',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
35+
,,,,,,,,,,,,,,,,,,,,,,,,,''',;;:;;:cllollllccc:;,'.',,,,,,,,,,,,,,,,,,,,,,,,,,,,
36+
,,,,,,,,,,,,,,,,,,,,,,'';:codxkkkkkkkkkkkkkkkkkkxocc:;',,,,,,,,,,,,,,,,,,,,,,,,,
37+
,,,,,,,,,,,,,,,,,;,'';coxkkkkkkkkkkkkkkkkkkkkkkkkkkkkxl:,',,,,,,,,,,,,,,,,,,,,,,
38+
,,,,,,,,,,,,,,,,'.,:dkkkkkkkkkkkkkkkkkxocdkkkkkkkkkkxdxxd:,',,,,,,,,,,,,,,,,,,,,
39+
,,,,,,,,,,,,,'.';lxkkkkkkkkkolxkkkkkkkkl;:okkkkkkkkxxxddxxo;.',,,,,,,,,,,,,,,,,,
40+
,,,,,,,,'...,:cdxxxkkkkkkkkx;;xkkkkkkkko;cooxkkkkkkkxdocldxdl,',,,,,,,,,,,,,,,,,
41+
,,,,,,,,,. .',,;:oxkkkkkkkxl::dkkkkkkkko,cOkcdkxxxkkkxl;,;:lol:,,,,,,,,,,,,,,,,,
42+
,,,,,,,,,,,''..:oxkkkkxdxkocl,:kkkkkxxkl,:ONklokxodkkkxc,;;;:coc,,,,,,,,,,,,,,,,
43+
,,,,,,,,,,,,,;oxkkkkkkolxx:dk::okkkkxxkl,cONNOccddcokkkl':lcccldc,,,,,,,,,,,,,,,
44+
,,,,,,,,,,,.:xkxxkkkkxllo:;kkodcoxkkxddc';ccll:..,,'lkkl,:lcclloo,',,,,,,,,,,,,,
45+
,,,,,,,,,'.,dxxxxkkkkdlxd,':;':c:oxkkc,,.;:,'....;. .cxc,lo:;cloo;.',,,,,,,,,,,,
46+
,,,,,,,,'.;ooldxxxkkkoll;;c:'. ,l:cx:.'.;' 'Ox.':;;:oo:;ollol;',,,,,,,,,,,,
47+
,,,,,,,'..''''cxdxkkko,.lXx. ;00xo:oKOx' ;XO'lk:cooo;dNdcoodl,,,;,,,,,,,,
48+
,,,,,,,,''',,',oc;dkkl..kWo ;XWWWXNWMNo .xWOdOocoool;d0oloodxdc,',,,,,,,,
49+
,,,,,,,,,,,,,,',..;dkl..dWO. lNMWWWWWWWXl..,dNWWXdcloooc,:lloooodxxdc'.,,,,,,
50+
,,,,,,,,,,,,,,;,.',,::coONNc :KWWWWWWWWWWWXKKXNNXdcloooo;,looooooooddxd:.',,,,
51+
,,,,,,,,,,,,,,,;,.,;'.'kWWWKdokXWWWWNNWWWWWWWWNNNXd:looooc,;clooolcooo::ox:.,;,,
52+
,,,,,,,,,,,,,,,'.';;;.;0NXXNNWWWWWWWXNWWWWWWWWWXKd:looooc,:c,:doo;.,ll',oo;.,;,,
53+
,,,,,,,,,,,,''''',;:;..cxk0KXNNNNNXXNNNKOkxollcc,'cooooc,:oc.'odl'.',,.;o;.',,,,
54+
,,,,,,,,,,,...''.':c;';c;,,:ool:,,,;ldxl,',;,;lc';oool;;:c;...:l,.,;;'',,',,,,,,
55+
,,,,,,,,,,,,,,,,,',;',::,..,lc'.......;oddkXKd;..;l:,.',,'',;,''.'''...',,,,,,,,
56+
,,,,,,,,,,,,,,,,,,'... ..,dxd:,.. .:xkkONWWWO;'clc'.......,;::cll. .,,,,,,,
57+
,,,,,,,,,,,,,,,,,,.,ooc'...'O0:'... ':okOXWWWWl;o;. ....kOddo0K, .,,,,,,,
58+
,,,,,,,,,,,,,,,,'.'xWX0Od, .d:.. ......;dk0kxOO'.'. ...:xc,;;k0, ',,,,,,
59+
,,,,,,,,,,,,,,,. ;xONWXOkl..d:. ,::,. ,oodd,.. . .xXX0O00XX; .;,,,,,
60+
,,,,,,,,,,,,,,,. ;OOXWXk:'..dk,. .. .,kXOOO, . .kWNWWWWWNc .,,,,,,
61+
,,,,,,,,,,,,,,,. .xkOK0k, .'lOkc,'..;oO00xOK, . dWWWWWWNWd. .,,,,,,
62+
,,,,,,,,,,,,,,,'. ;kkxxko.'xOOOOOOkOOOOOOdOK:. .;;.. :XWWWWNXKx. .,,,,,,
63+
,,,,,,,,,,,,,,,,,..;odk00d,'d00KKKK0KKK0OkKKdxxoo, . .;lccc:;,'....',,,,,,,
64+
,,,,,,,,,,,,,,,,,,;';oxkxxl..looooooooollll:'.. ''........''',,,;;,,,,,,,,,,,,
65+
,,,,,,,,,,,,,,,,,,,,,,;,'',. ..lXO;,;,;,,,,,,,,,,,,,,,,,,,,,,,
66+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,...... ...',,;xKo',,,,,,,,,,,,,,,,,,,,,,,,,,
67+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,;;,,,,,''',,,,,,,,,,;;'',,,,,,,,,,,,,,,,,,,,,,,,,
68+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,;,,,,,,,,,,,,,,,,,,,,,,,,,,,,
69+
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
70+
71+
"""
72+
TITLE = """
73+
____ _ _ _ _
74+
| _ \(_) | | | | | |
75+
| |_) |_ _ __ __| |_ ____ _| |_ ___| |__
76+
| _ <| | '__/ _` \ \ /\ / / _` | __/ __| '_ \
77+
| |_) | | | | (_| |\ V V / (_| | || (__| | | |
78+
|____/|_|_| \__,_| \_/\_/ \__,_|\__\___|_| |_|
79+
80+
"""
81+
2982
SCRAPE_N_TWEETS = 20
3083
IS_DEBUG = False
3184

@@ -77,13 +130,26 @@ def remove_elements(driver, elements, remove_parent=True):
77130
}}
78131
""".format(",".join(elements)))
79132

80-
def calc_average(lst):
81-
if len(lst) < 4:
82-
return sum(lst) / len(lst)
83-
84-
cut_off = int(len(lst) * 0.25)
85-
s = sorted(lst)[cut_off:len(lst) - cut_off]
86-
return sum(s) / len(s)
133+
def calc_average(percentage):
134+
def _func(lst):
135+
if len(lst) < 4:
136+
return sum(lst) / len(lst)
137+
138+
cut_off = int(len(lst) * percentage)
139+
s = sorted(lst)[cut_off:len(lst) - cut_off]
140+
return sum(s) / len(s)
141+
return _func
142+
143+
def window_average(window):
144+
def _func(lst):
145+
v = lst[:-min(window, len(lst))]
146+
return sum(v) / len(v)
147+
return _func
148+
149+
def constant(const):
150+
def _func(lst):
151+
return const
152+
return _func
87153

88154
def remove_ads(driver):
89155
return driver.execute_script("""
@@ -99,7 +165,7 @@ def remove_ads(driver):
99165
""")
100166

101167

102-
def fetch_html(driver, url, fpath, load_times, force=False, number_posts_to_cap=SCRAPE_N_TWEETS, bio_only=False):
168+
def fetch_html(driver, url, fpath, load_times, offset_func, force=False, number_posts_to_cap=SCRAPE_N_TWEETS, bio_only=False):
103169
driver.get(url)
104170
state = ""
105171
while state != "complete":
@@ -253,7 +319,7 @@ def fetch_html(driver, url, fpath, load_times, force=False, number_posts_to_cap=
253319
break
254320

255321
# Scroll!
256-
driver.execute_script("window.scrollTo(0, {});".format(estimated_height + calc_average(height_diffs)))
322+
driver.execute_script("window.scrollTo(0, {});".format(estimated_height + offset_func(height_diffs)))
257323
time.sleep(random.uniform(load_times, load_times + 2))
258324
new_height = driver.execute_script("return document.body.scrollHeight")
259325
if new_height == last_height:
@@ -277,8 +343,13 @@ def parse_args():
277343
parser.add_argument("--posts", "-p", help="Max number of posts to screenshot.", default=SCRAPE_N_TWEETS, type=int)
278344
parser.add_argument("--bio-only", "-b", help="Only store bio, no snapshots or tweets.", action="store_true")
279345
parser.add_argument("--debug", help="Print debug output.", action="store_true")
280-
parser.add_argument("--login", help="Prompt user login to remove tweet limit..", action="store_true")
346+
parser.add_argument("--login", help="Prompt user login to remove limits / default filters. USE AT OWN RISK.", action="store_true")
281347
parser.add_argument("--scroll-load-time", "-s", help="Number of seconds (float). The higher, the stabler the fetch.", default=5, type=int)
348+
parser.add_argument("--scroll-algorithm", help="Type of algorithm to calculate scroll offset.", choices=["percentile", "window", "constant"], default="window")
349+
parser.add_argument("--scroll-value", default=5, type=float, help=("Value used by --scroll-algorithm."
350+
"If percentile, percentage of percentile calculated. "
351+
"If window, the size of window average."
352+
"If constant, size of pixel to scroll by."))
282353

283354
group = parser.add_mutually_exclusive_group()
284355
group.add_argument("--input-json", "-i", help="Input json file", default="input.json")
@@ -290,9 +361,22 @@ def main():
290361
global IS_DEBUG
291362
IS_DEBUG = args.debug
292363

364+
print(LOGO)
365+
print(TITLE)
366+
293367
output_folder = "snapshots"
294368
os.makedirs(output_folder, exist_ok=True)
295369
extra_args = {"force": args.force, "bio_only": args.bio_only, "load_times": args.scroll_load_time, "number_posts_to_cap": args.posts}
370+
f = None
371+
if args.scroll_algorithm == "percentile":
372+
assert args.scroll_value <= 1.0 and args.scroll_value >= 0.0
373+
f = calc_average(args.scroll_value)
374+
elif args.scroll_algorithm == "window":
375+
f = window_average(args.scroll_value)
376+
else:
377+
f = constant(args.scroll_value)
378+
379+
extra_args["offset_func"] = f
296380

297381
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
298382
if args.login:

demo.gif

3.75 MB
Loading

logo.png

520 KB
Loading

0 commit comments

Comments
 (0)