- (internal) restructure custom WikipediaQL selectors internally into having "attrs" and "functions" (CSS pseudo-selectors), paving the way for more consistency;
- add
sentence:first
andsection:first
functions; - improve default YAML output formatting;
- improve
table-data
quasi-selector behavior to handle mid-table full-rowth
(it gets attached as a prefix to subsequent rows).
- Add executable script
wikipedia_ql
to run from command-line! - Streamline and simplify the language.
- There is now only one way of nesting:
selector1 >> selector2
,{}
is left only for grouping purposes; - Follow the CSS selectors intuition:
text["pattern"]
is nowtext:matches("pattern")
,sentence["pattern"]
issentence:contains("pattern")
, andtext-group[1]
istext-group[group=1]
- There is now only one way of nesting:
- A number of internal bugfixes and simplifications
- The further development is covered in near-the-realtime on my Substack
- The largest highlight is
table-data
quasi-selector/filter, allowing to regularize tables and fetch data by columns/rows. See Tables.md for details; - Improve results joining to be (a bit) more reasonable and allowing to fetch multiple hashes;
- Add support for fetching style attributes (naive!) with selectors like
@style-background
; - Support
$=
and^=
attribute operators.
The largest highlight of the release is links following: now you can go through pages graph in a single query!
from wikipedia_ql import media_wiki
wikipedia = media_wiki.Wikipedia()
pprint(wikipedia.query(r'''
from "The Wachowskis" {
section[heading="Films"] {
table >> tr >> td:nth-child(2) {
a -> {
@title as "film";
section[heading="Critical response"]
>> sentence["Rotten Tomatoes"]
>> text["\d+%"] as "rotten-tomatoes";
}
}
}
}
'''))
This prints:
[{'film': 'Assassins (1995 film)'}, {'film': 'Bound (1996 film)', 'rotten-tomatoes': '89%'}, {'film': 'The Matrix', 'rotten-tomatoes': '88%'}, {'film': 'The Matrix Revisited'}, {'film': 'The Animatrix'}, {'film': 'V for Vendetta (film)', 'rotten-tomatoes': '73%'}, {'film': 'The Invasion (film)'}, {'film': 'Speed Racer (film)', 'rotten-tomatoes': '41%'}, {'film': 'Ninja Assassin', 'rotten-tomatoes': '26%'}, {'film': 'Cloud Atlas (film)', 'rotten-tomatoes': '66%'}, {'film': 'Jupiter Ascending', 'rotten-tomatoes': '27%'}, {'film': 'The Matrix Resurrections'}]
(Yeah, not all movies are there... Table parsing is still to be enhanced)
Other changes in query language:
text
without pattern (the text of entire element);text-group["group name"]
named groups from text patterns;sentence
without pattern (all sentences in the scope)- standalone
@attribute
(attribute of the current selected element); - for
@href
andimg@src
attributes URLs are now expanded to fully-qualified ones.
Changes to HTTP query engine:
- Using Parsoid API instead of default Wikipedia api.php to fetch HTML content — #2;
- Providing sensible User-Agent string (and allow library user to rewrite it) — #1.
The release that actually works :)
Initial release