Skip to content

Conversation

@jsanchez07
Copy link

ASO-33

We were getting a lot of errors in the logs with this audit, most of them being 403 errors when the audit would try to fetch the pages.

  • Implemented rate limiting
  • user agent to avoid bot detection
  • configured VirtualConsole to suppress CSS parsing errors from JSDOM
  • enhanced logs

jsanchez_adobe and others added 3 commits October 15, 2025 16:43
- I'm fetching the pages with a rate limiter of 10 pages per second instead of all 200
- I'm suppressing console errors for JSDOM errors which are not useful
@github-actions
Copy link

This PR will trigger no release when merged.

Copy link
Contributor

@iuliag iuliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL deduplication

Why is URL deduplication necessary? Please detail this and document it in the PR description or, if it makes more sense, in the import worker where the top pages are actually imported and stored in DynamoDB, or in the data access layer.

User agent
There's existing functionality to use available in spacecat-shared. It's important to reuse a single user agent, so that we don't have to ask customers to allowlist n variants of the same user agent, and it also makes it easier to debug issues in the future.

Using Cursor in a workspace where you have all spacecat repositories checked out works quite decently to search for existing reusable constructs or patterns in the codebase:

export const SPACECAT_USER_AGENT = 'Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 Spacecat/1.0';
function getSpacecatRequestHeaders() {
  return {
    Accept: 'text/html,application/xhtml+xml,application/xml,text/css,application/javascript,text/javascript;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Cache-Control': 'no-cache',
    Pragma: 'no-cache',
    Referer: 'https://www.adobe.com/',
    'User-Agent': SPACECAT_USER_AGENT,
  };
}

The tracingFetch function (imported as fetch) automatically adds the SPACECAT_USER_AGENT if no user agent is provided:

// find user-agent header in headers case insensitively
let hasUserAgent = false;
Object.keys(options.headers).forEach((key) => {
  if (key.toLowerCase() === 'user-agent') {
    hasUserAgent = true;
  }
});

if (!hasUserAgent) {
  options.headers['User-Agent'] = SPACECAT_USER_AGENT;
}

Logging
At the end of development and before merging, please revise log statements and lower to debug the ones that are not actually critical for production.
Too verbose logging will lead to hitting our Coralogix quota.
An e.g. https://github.com/adobe/spacecat-audit-worker/pull/1435/files#diff-6427df96e5f667a262860db38ab879781599ad1ba254a7c459edcfb4023782a6R67

@iuliag iuliag changed the title Aso 33 hreflang 403 errors fix: Fix hreflang 403 errors (ASO-33) Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants