Skip to content

Conversation

absurdlylongusername
Copy link
Member

@absurdlylongusername absurdlylongusername commented Jul 8, 2025

  • I carefully read the contribution guidelines and agree to them.
  • I have tested the API against NewPipe.
  • I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.

Validate HTTP response codes for network requests throughout the extractor

EDIT: Repurposing this PR to be the first of a series of PRs that will refactor the extractor codebase to add HTTP response checks for all network requests to improve logging and error handling/reporting.

Almsot all code that uses the Downloader does not do any checking on the response code: therefore errors and exceptions thrown due to error response code give false information (e.g. tests failing due to 404 were giving error about clientId when the error was 404).

Based on discussions you can read below, I have initially added some methods in Response.java for validating response codes and used it in places for soundcloud extraction. It will throw exception with information about the error code, which will propagate upwards to the call site which will print a more useful stack trace for debugging.

As we discuss more on how we want to design and refactor this, further changes will be made in other places

@absurdlylongusername absurdlylongusername changed the title Soundcloud validate http response [SoundCloud] Validate HTTP response codes Jul 8, 2025
@TobiGr TobiGr added bug Issue is related to a bug soundcloud service, https://soundcloud.com/ tests Issues and PR related to unit tests labels Jul 8, 2025
Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thaank you. Changes LGTM, but conflicts need to be resolved first.

Copy link
Member

@litetex litetex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can be merged in the current state:

  • There are Todo comments
  • The changes in the test will be irrelevant after #1332
  • Response codes that are ok are 2XX response codes
  • I'm really not sure about another HttpUtils method and a new Exception class

I think the most simple approach would be to have a boolean Response#isOk() method and then use this method to control the underlying use cases accordingly.

@absurdlylongusername
Copy link
Member Author

@litetex

  • There are Todo comments

The comment was kind of intentionally left in there in the hopes that whoever reviewed it would provide an explanation and then I would remove it.

  • Response codes that are ok are 2XX response codes

Not exactly sure what you mean by this? Do you mean that 2xx is the only "success" code range all network calls throughout the codebase?

  • I'm really not sure about another HttpUtils method and a new Exception class
    I think the most simple approach would be to have a boolean Response#isOk() method and then use this method to control the underlying use cases accordingly.

Do you mean another as in having two validateResponseCode methods, one in Response and one in HttpUtils, or do you mean that there's another HttpUtils class somewhere (because I can't seem to find one)?

And does this also mean we shouldn't have a HttpResponseException? If so, why? I think having a specific exception for Http errors is better than throwing a generic IOException, and also makes for better logging and diagnostics, because right now our code does not give any useful information at all for what causes an error if the cause is an invalid response.


My idea was to have a generic method we can use to check for expected response codes instead of just checking it's 2xx or not 2xx, because I was of the assumption that:

  1. We are definitely not validating response codes anywhere in the codebase (I can't see anywhere we do this)
  2. There may be scenarios where we want to treat a specific 3xx or 4xx as not an error and it's not efficient to catch an exception in these cases and then check the response code (and then throw another exception with info about the http error that will get propagated upwards), as I believe we should never throw exception if we don't have to, and exceptions should only be thrown for unexpected behaviour, or in the same scenarios we already throw exceptions.

The end goal is that if a network call ever fails, we want to know why, using code that is:

  • As small as possible
  • Easy to understand
  • Easy to use
  • Easy to extend

If there's only one place in code where non-2xx codes shouldn't be considered an error then we could handle that case explicitly at that call site, but if it's at least 3 times then I think we should have more than a simple isOk method.
Could perhaps have other methods like Response#isRedirect for 3xx, Response#isClientError for 4xx, Response#isServerError instead of a single generic method.

Since I am almost certain 1. is true, I am more than happy to trawl through the entire codebase for every http call we make (not jusr for get, but for post and head too) and add response code validation and see which approach makes the most sense.

I could update this PR with these changes and/or split it into multiple PRs if it is too big.

Thoughts?

Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We do need to check response codes more often. For example, PeerTube sometimes throws 429 Too Many Requests (TeamNewPipe/NewPipe#2979), but that shouldn't be treated as a recaptcha by the Downloader (while that's what happens now). It should be the PeerTube service's job to throw a normal ExtractorException for 429s, and instead the YouTube service's job to throw a RecaptchaException (along with a proper URL that users can use to solve the captcha, instead of having the Downloader guess that URL).

It would be good if you could take the above paragraph into consideration for this PR, and properly validate the HTTP response codes of all requests made by all services in the extractor, so we don't have to change the design once again later.


return null;
}
// CHECKSTYLE:OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why checkstyle off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, checkstyle was complaining about line being too long in the comment below, so I had to disable it

Copy link
Member

@Stypox Stypox Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really important, but why don't you make the lines shorter then? Here you don't have long URLs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@absurdlylongusername absurdlylongusername force-pushed the soundcloud-validate-http-response branch from 5a3f458 to ae4e8fc Compare October 5, 2025 17:34
* or when {@code validResponseCodes} is empty and the code is a 4xx or 5xx error.
*/
// CHECKSTYLE:ON
public static void validateResponseCode(final Response response,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not inline this directly within the non-static Response::validateResponseCode? This is basically a method on the Response class since it takes Response as the first argument, so you might as well make it non-static.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention was to allow for response.validateResponseCode and validateResponseCode(response) depending on the format of the call site.

But in hindsight, I now recall that in C# (my main language) there is EnsureSuccessStatusCode which can be used on the response itself, rather than a separate method.

So yeah I think you're right, validation can just be done inline like how you have mentioned in your other comment

* or when {@code validResponseCodes} is empty and the code is a 4xx or 5xx error.
*/
// CHECKSTYLE:ON
public Response validateResponseCode(final int... validResponseCodes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support the needs of other services (like youtube), I'd add the following methods:

public Response throwIfResponseCode(int errorCode, final Function<Response, HttpResponseException> errorSupplier);
public Response throwIfResponseCodeInRange(final int errorCodeMin, final int errorCodeMax, final Function<Response, HttpResponseException> errorSupplier);

I'd also rename this method to throwIfInvalidResponseCode()

Then e.g. for YouTube we'd be able to do:

response
    // "https://www.youtube.com" is the URL to solve the captcha
    .throwIfResponseCode(429, r -> new ReCaptchaException(r, "https://www.youtube.com"))
    .throwIfInvalidResponseCode()

While in PeerTube

response
    .throwIfResponseCode(429, r -> new TooManyRequestsException(r))
    .throwIfInvalidResponseCode()

And we'd remove the ReCaptchaException checks in the Downloader. Also we'd have ReCaptchaException and TooManyRequestsException extend HttpResponseException.

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

However, I think throwIfInvalidResponseCode should be called ensureValidResponseCode or ensureSuccessResponseCode if it's going to be the same as validateResponseCode which takes in a list of valid response codes. Consider these:

  • throwIfInvalidResponseCode(200, 201)
  • ensureValidResponseCode(200, 201)

Reading the second, it is easier to understand that 200 and 201 are the valid response codes, but one could presume with the first that 200 and 201 are invalid response codes and it should throw if one of those are returned.

Thoughts?

Copy link
Member

@Stypox Stypox Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureValidResponseCode is a better name indeed

@absurdlylongusername
Copy link
Member Author

Thanks! We do need to check response codes more often. For example, PeerTube sometimes throws 429 Too Many Requests (TeamNewPipe/NewPipe#2979), but that shouldn't be treated as a recaptcha by the Downloader (while that's what happens now). It should be the PeerTube service's job to throw a normal ExtractorException for 429s, and instead the YouTube service's job to throw a RecaptchaException (along with a proper URL that users can use to solve the captcha, instead of having the Downloader guess that URL).

It would be good if you could take the above paragraph into consideration for this PR, and properly validate the HTTP response codes of all requests made by all services in the extractor, so we don't have to change the design once again later.

@Stypox Some problems.

In BandcampChannelLinkHandlerFactory.getId if I want to add HTTP response code validation, then this means that I need to handle the HttpResponseException.

Now, LinkHandlerFactory.getId and other base methods all throw ParsingException, but not ExtractionException. getId in the bandcamp class catches IOException | ReCaptchaException | ArrayIndexOutOfBoundsException | JsonParserException and swallows them all into a ParsingException.

Semantically, only JsonParserException is a parsing exception, and I don't know the rationale behind catching ArrayIndexOutOfBoundsException.

Now, what exactly is the difference between an ExtractionException and a ParsingException?

LinkHandlerFactory and pretty much all the other abstract base class extractors and similar classes all throw ParsingException, and they do that by swallowing up all other exceptions and wrapping it in a ParsingException.

If we add http response validation throughout the extractor, should the HttpResponseException be swallowed and wrapped in a ParsingException? Because as far as I can see that is what happens throughout the codebase regardless of the type of exception thrown when we make network requests.

How exactly do we want to do this, so that we can have correct semantic exceptions?

@absurdlylongusername absurdlylongusername changed the title [SoundCloud] Validate HTTP response codes Validate HTTP response codes across the extractor Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue is related to a bug soundcloud service, https://soundcloud.com/ tests Issues and PR related to unit tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants