There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).
For example,
https://www.example.com/index.html.Beginning_of_following_paragraph which could be resolved by accepting only one period after the url, except that
https://www.example.com/index.htmlBeginning_of_following_paragraph would still not be resolved.
I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.
There is an unresolved issue when parsing for urls that bleed into regular text (often because of rich text features like tables etc.).
For example,
https://www.example.com/index.html.Beginning_of_following_paragraphwhich could be resolved by accepting only one period after the url, except thathttps://www.example.com/index.htmlBeginning_of_following_paragraphwould still not be resolved.I think an easier solution might be to offer some optional cleaning functions for the dataframes that archivr produces, but there could be other ideas.