-
Notifications
You must be signed in to change notification settings - Fork 13
fix: Fix hreflang 403 errors (ASO-33) #1435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- I'm fetching the pages with a rate limiter of 10 pages per second instead of all 200 - I'm suppressing console errors for JSDOM errors which are not useful
|
This PR will trigger no release when merged. |
…/spacecat-audit-worker into ASO-33-hreflang-403-errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
URL deduplication
Why is URL deduplication necessary? Please detail this and document it in the PR description or, if it makes more sense, in the import worker where the top pages are actually imported and stored in DynamoDB, or in the data access layer.
User agent
There's existing functionality to use available in spacecat-shared. It's important to reuse a single user agent, so that we don't have to ask customers to allowlist n variants of the same user agent, and it also makes it easier to debug issues in the future.
Using Cursor in a workspace where you have all spacecat repositories checked out works quite decently to search for existing reusable constructs or patterns in the codebase:
- SPACECAT_USER_AGENT Constant
Location:@adobe/spacecat-shared-utils/src/tracing-fetch.js:19
export const SPACECAT_USER_AGENT = 'Mozilla/5.0 (Linux; Android 11; moto g power (2022)) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Mobile Safari/537.36 Spacecat/1.0';- getSpacecatRequestHeaders() Function
Location:@adobe/spacecat-shared-utils/src/url-helpers.js:124-133
function getSpacecatRequestHeaders() {
return {
Accept: 'text/html,application/xhtml+xml,application/xml,text/css,application/javascript,text/javascript;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Cache-Control': 'no-cache',
Pragma: 'no-cache',
Referer: 'https://www.adobe.com/',
'User-Agent': SPACECAT_USER_AGENT,
};
}- tracingFetch Automatic User Agent
Location:@adobe/spacecat-shared-utils/src/tracing-fetch.js:164-174
The tracingFetch function (imported as fetch) automatically adds the SPACECAT_USER_AGENT if no user agent is provided:
// find user-agent header in headers case insensitively
let hasUserAgent = false;
Object.keys(options.headers).forEach((key) => {
if (key.toLowerCase() === 'user-agent') {
hasUserAgent = true;
}
});
if (!hasUserAgent) {
options.headers['User-Agent'] = SPACECAT_USER_AGENT;
}Logging
At the end of development and before merging, please revise log statements and lower to debug the ones that are not actually critical for production.
Too verbose logging will lead to hitting our Coralogix quota.
An e.g. https://github.com/adobe/spacecat-audit-worker/pull/1435/files#diff-6427df96e5f667a262860db38ab879781599ad1ba254a7c459edcfb4023782a6R67
- called the spacecat-shared user agent - removed the deduplication code - made logs less verbose
ASO-33
We were getting a lot of errors in the logs with this audit, most of them being 403 errors when the audit would try to fetch the pages.