Skip to content

Conversation

@fgregg
Copy link

@fgregg fgregg commented May 23, 2016

This PR enables wildcard url fetches from wayback machine.


filepath = os.path.join(filedir, path_tail)
filepath = os.path.join(filedir,
','.join((path_tail, asset.timestamp)))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these change to the path are because I found it more convenient, for my own purposes, to have different versions of the same resource name in the same file. I can make a version that keeps your current, separate directory, behavior.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. I'd prefer to keep the current, separate-directory behavior. But I'm curious: What makes the other way more convenient for your purposes?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s., Thanks! This seems to be a really elegant solution.

@fgregg fgregg mentioned this pull request May 23, 2016
@fgregg
Copy link
Author

fgregg commented Jun 12, 2016

K, this should be good to merge.

@jsvine jsvine mentioned this pull request Sep 17, 2016
@fgregg
Copy link
Author

fgregg commented Nov 3, 2017

Hi @jsvine, from the discussion in #8, it sounds like this is not the approach you want to take. Should I close this PR?

@jsvine
Copy link
Owner

jsvine commented Nov 3, 2017

Hi @fgregg and apologies for the radio silence. I actually do like this approach! I've been working on a new version that integrates the bulk of your PR (with a few tweaks), but have gotten distracted by other things. Thanks for the nudge!

@jsvine jsvine mentioned this pull request Aug 2, 2019
ksadov added a commit to ksadov/waybackpack that referenced this pull request Mar 18, 2024
@Joey-Einerhand
Copy link

@jsvine Any updates on this?

Hi @fgregg and apologies for the radio silence. I actually do like this approach! I've been working on a new version that integrates the bulk of your PR (with a few tweaks), but have gotten distracted by other things. Thanks for the nudge!

Any updates on this? Would you like it implemented in a different way?

@jsvine
Copy link
Owner

jsvine commented Apr 16, 2024

Hi @Joey-Einerhand, unfortunately there are no major updates on this. I ran into some operating-system-appeasing issues, got stumped, and haven't fully revisited.

@tomcardoso
Copy link

Hey @jsvine, would love to help get this PR over the line if I can. I wonder if some of those OS issues are no longer factors with the passage of time / newer Python versions? I'm using a version of the repo checked out at the PR, but at this point it's missing some nice-to-have features like --no-clobber and --delay (think I just got IP banned from Wayback Machine for a few minutes as a result, oops). Would also be interesting to allow for wildcards inside URLs. For instance, the following would only grab pages ending in .html: waybackpack http://newyork.craigslist.org/**/*.html --from-date 2005 --to-date 2008 --delay 1 --no-clobber -d .

@jsvine
Copy link
Owner

jsvine commented Apr 1, 2025

@tomcardoso Many thanks for the offer and the enthusiasm! I'm open to merging this PR / something like it, with the main caveat that I'd prefer it not be the default behavior. I.e., it would require a flag like --wildcard to operate.

My main reason for that is that the issues with case-insensitivity on some OSes, including MacOS, remain. But the documentation for --wildcard could clearly indicate that limitation, as could a warning message on execution. (This seems to be a clever way to test programmatically whether an OS is case-(in)sensitive.)

Would you be up for resolving the merge conflicts and adding that warning?

@tomcardoso
Copy link

tomcardoso commented Apr 2, 2025

@jsvine makes sense to me. Do you have a preference on globbing/no globbing? Not sure how annoying that is to implement, but I think it would be valuable. (I'm also not even sure the Archive.org server would support that kind of behaviour easily – I'll need to read the docs.) Re: casing, what was the issue originally? That some OSes are case sensitive and others aren't, so the functionality would have to default to case-insensitive? I think I'd prefer case-insensitivity be the default behaviour anyway, so it's no big deal.

I'll take a stab over the next week or two. Evenings and weekends, you know how it goes!

@jsvine
Copy link
Owner

jsvine commented Apr 20, 2025

Hiya, and apologies for the delay in answering these questions:

Do you have a preference on globbing/no globbing? Not sure how annoying that is to implement, but I think it would be valuable. (I'm also not even sure the Archive.org server would support that kind of behaviour easily – I'll need to read the docs.)

If globbing is possible, I agree that it'd be a handy feature.

Re: casing, what was the issue originally?

The situation in this comment attempts to illustrate it: #8 (comment)

But, in short, the issue is that some OSes don't distinguish between differently-capitalized paths. E.g., PATH/TO/file.txt and path/to/FILE.tXt are considered the same. This becomes a problem if/when a server/website's paths are case-sensitive, and different files live at PATH/TO/file.txt vs. path/to/FILE.tXt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants