-
Notifications
You must be signed in to change notification settings - Fork 910
Add Node.js version selection strategy framework with EOL policy support #5422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Node.js version selection strategy framework with EOL policy support #5422
Conversation
…or node strategy and response data models for each strategy result
…github.com/microsoft/azure-pipelines-agent into users/rishabhmalikMS/NodehandlerStrategies
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| nameof(EnableEOLNodeVersionPolicy), | ||
| "When enabled, automatically upgrades tasks using end-of-life Node.js versions (6, 10, 16) to supported versions (Node 20.1 or Node 24). Throws error if no supported versions are available on the agent.", | ||
| new PipelineFeatureSource("AGENT_RESTRICT_EOL_NODE_VERSIONS"), | ||
| new EnvironmentKnobSource("AGENT_RESTRICT_EOL_NODE_VERSIONS"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DO we need both sources for toggle to work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we planning to do E2E test for this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One knob is used as a feature flag to enable/disable the feature all together. Another is used to read the value of toggle from UI injected from server side. The same feature flag can be used to enable/disable Ui toggle & agent side functionality to maintain consistency on both ends.
Regarding E2E testing: Planning to use devfabric for updated server-side logic along with updated agent logic to test integrated functionality.
| /// <summary> | ||
| /// True if Node20 has glibc compatibility errors (requires glibc 2.17+). | ||
| /// </summary> | ||
| public bool Node20HasGlibcError { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO These are node specific and should not be part of the common context I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense but from implementation perspective. If we remove glibc compatibility flags from NodeContext and try to move inside individual node 24 and 20 strategies, it will cause following issues:
- Duplicate runtime checks for these: these are required in both legacy and new approach as well in both node 24 and node 20 strategies. We would end up having the same runtime checks at 3 places instead of 1.
- Sames checks would run multiple times in both node 24 and node 20 handling. And if we somehow try to reuse by having it at single place we would need to pass them as context to node 24 and node 20 strategies along with legacy code. Which means this will end up outside the individual strategy code boundry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a clean design here instead of optimizing for implementation. Three checks should not be an issue IMO.
| /// <summary> | ||
| /// The selected node version determined by CanHandle(). | ||
| /// Examples: "node24", "node20_1", "node16", "node10", "node" | ||
| /// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are output and I assumed taht this class is imput context so we shold not merge these 2 into single object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the proposed fix for this:
Removing 3 fields from NodeContext: SelectedNodeVersion, SelectionReason and SelectionWarning
Since these are used in NoeRunnerInfo, will remove these from NodeContext and reuse from NodeRunnerInfo in strategy code.
As this will require updates in Orchestrator to run the strategies which utilizes results from strategies functions, will incorporate this change in next PR for orchestrator + node handler integration immediately following this PR closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes for this are in this next PR #5425
src/Agent.Worker/NodeVersionStrategies/IUnifiedNodeVersionStrategy.cs
Outdated
Show resolved
Hide resolved
| /// </summary> | ||
| /// <param name="context">Context with environment, task, and glibc information</param> | ||
| /// <returns>True if this strategy can handle the context, false otherwise</returns> | ||
| bool CanHandle(UnifiedNodeContext context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input should be taskcontext and output should NodePathResult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently canHandler function determines if the corresponding node can be selected or not based on boolean value. We have separate function getNodePath in each strategy which frames the node path based on the node selected. This is done to keep 2 operations separate:
- Whether a node can be used or not
- If it can be used if have getNodePath function which generate the path based on host/container environment and returns the result containing the same
If we aim to return NodePathResult (NodeRunnerInfo now after naming convention updates), this would not allow us to have single responsibilty for the canHandle and getNodePath functions.
| /// Unified context for both host and container node selection. | ||
| /// Contains runtime data - strategies read their own knobs via ExecutionContext. | ||
| /// </summary> | ||
| public sealed class UnifiedNodeContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming the node is selcted based on task.json, so this should be TaskContext IMO ? also does it need anything else apart from what is defined in task.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaskContext might be misleading here, as the data model contains details required in node selection process and not general task execution.
Regarding "I am assuming the node is selected based on task.json", yes, node is selected based on the handler we get from task.json processed by task runner and passed to node handler which is used in both existing logic and the new node strategy classes for taking decisions.
|
|
||
| if (eolPolicyEnabled) | ||
| { | ||
| throw new NotSupportedException(StringUtil.Loc("NodeEOLFallbackBlocked", "Node20", "Node16")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are deliberately throwing an exception from the worker process, have we validated this through a pipeline run to confirm whether the worker actually crashes or returns a failure code? My main concern is whether all failure scenarios have been thoroughly tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure scenarios are the ones where we are unable to find any node available, either due to compatibility issues, end of life nodes being disabled. In these, we are throwing not supported exception with relevant message. Currently L0 tests are available. Will be performing E2E tests and actual pipeline run as well for further testing related work.
| string systemType = context.IsContainer ? "container" : "agent"; | ||
| context.SelectedNodeVersion = "node16"; | ||
| context.SelectionReason = $"{baseReason}, fallback to Node16 due to Node20 glibc compatibility issue"; | ||
| context.SelectionWarning = StringUtil.Loc("NodeGlibcFallbackWarning", systemType, "Node20", "Node16"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using Node 20.1, there are references to both Node20 and Node20_1 in various places. can you please check if we can make these references consistent across all locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take this as a part of next PR.
|
@rishabhmalikMS ,can you please check if we can add additional Kusto telemetry to capture the customer’s Node version? This would allow us to diagnose issues more quickly without relying on log checks, improving investigation speed and accuracy. |
Will check on which information would be relevant and helpful to publish in telemetry. Will take this as part of future pull request. Created a work item (https://mseng.visualstudio.com/AzureDevOps/_workitems/edit/2340945) for the same. |
| /// <summary> | ||
| /// The handler data from the task definition. | ||
| /// </summary> | ||
| public BaseNodeHandlerData HandlerData { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contains the handler type available to task.
| /// <summary> | ||
| /// Host context for directory lookups. | ||
| /// </summary> | ||
| public IHostContext HostContext { get; set; } | ||
|
|
||
| /// <summary> | ||
| /// Execution context for logging, warnings, and knob reading. | ||
| /// </summary> | ||
| public IExecutionContext ExecutionContext { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this in NodeContext ? The decision for which node to use is based on Container/OS and task.json, so ideally we should not have anything beyond these in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HostContext and ExecutionContext are removed from TaskContext.
| /// Context for both host and container node selection. | ||
| /// Contains runtime data - strategies read their own knobs via ExecutionContext. | ||
| /// </summary> | ||
| public sealed class NodeContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it called Node context? this is created per task if I am not wrong, can it be called Task context in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeContext is renamed as TaskContext
| /// Explanation of why this version was selected. | ||
| /// Used for debugging and telemetry. | ||
| /// </summary> | ||
| public string Reason { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this variable name more relevant?
|
|
||
| public sealed class Node6Strategy : INodeVersionStrategy | ||
| { | ||
| public string Name => "Node6"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we should move these string literals to a constants file. This way, we can reuse them consistently, as I see they are being used here as well as in the logs. It would make tracking and maintenance much easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name field is removed. We are now using strategy.GetType().name instead of this. This was essentially used for logging purposes. No more needed.
| return false; | ||
| } | ||
|
|
||
| private bool DetermineNodeVersionAndSetContext(NodeContext context, bool eolPolicyEnabled, string baseReason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know your thoughts on whether we should move this to a common unified location. It is currently declared four times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DetermineNodeVersionAndSetContext will be required in each strategy as it has custom code for each strategy. This cannot be moved to another place
… logic - Passing execution and host context to constructors of strategies - Add GlibcCompatibilityChecker for unified glibc compatibility testing across host and container environments - Introduce GlibcCompatibilityInfo data model to encapsulate Node20/24 glibc compatibility results - Simplified NodeContext (now TaskContext) to provide runtime context (handler data, container info and step target only) to strategies - Support both host glibc testing and container pre-computed glibc flags (NeedsNode20Redirect/NeedsNode16Redirect) - Remove redundant IsContainer field, use Container != null for cleaner logic - Moved GetNodePath to orchestrator as CreateNodeRunnerInfoWithPath
|
|
||
| GlibcCompatibilityInfo glibcInfo = GlibcCompatibilityInfo.Compatible; | ||
|
|
||
| if (context.Container == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeVersionOrchestrator.cs is selector class, responsible for selecting an appropriate strategy. can the creation of glibcInfo moved to strategy concrete class (or a parent class) to have this class single responsibility
| bool node20HasGlibcError = false; | ||
| bool node24HasGlibcError = false; | ||
|
|
||
| if (!useNode20InUnsupportedSystem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these checks be reversed first validate Node24 then Node20?
| { | ||
| if (_supportsNode20.HasValue) | ||
| { | ||
| node20HasGlibcError = !_supportsNode20.Value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this __supportsNode20 initialized?
| return DetermineNodeVersionSelection(context, eolPolicyEnabled, "Selected for Node20 task handler", glibcInfo); | ||
| } | ||
|
|
||
| if (eolPolicyEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't eolpolicy be chedked before hasnode20handler? or it should be first check?
| _strategies.Add(new Node20Strategy()); | ||
| _strategies.Add(new Node16Strategy()); | ||
| _strategies.Add(new Node10Strategy()); | ||
| _strategies.Add(new Node6Strategy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my understanding I think from the priority in the strategy , we moved to this order
shouldn't we add a comment here to make sure new node versions are added in proper priority order?
…run using a feature flag knob value - Added test integration to run Test scenarios on both original and strategy code
Context
This PR introduces the Node.js version selection strategy framework as part of modernizing Azure Pipelines Agent's Node.js handling. This is the first implementation step following comprehensive test specifications established in previous work. The strategy pattern provides a foundation for implementing EOL (End-of-Life) Node.js version policies andnode handler selection across both host and container environments.
Related Work: This builds on test specifications for Node Handler behavior and enables future EOL policy enforcement and node selection logic.
Workitem: AB#2339020
Description
Implemented comprehensive strategy pattern for Node.js version selection with the following components:
Core Strategy Framework:
INodeVersionStrategy.cs- Strategy interface definingCanHandle()andGetNodePath()contractsNodeContext.cs- Context model containing environment state, glibc compatibility flags, and handler dataNodePathResult.cs- Result model with node path, version, selection reason, and optional warningsStrategy Implementations:
Key Features:
AGENT_ENABLE_EOL_NODE_VERSION_POLICYRisk Assessment (Low / Medium / High)
Low Risk - This PR introduces only new strategy implementation code without modifying existing NodeHandler logic. The strategies remain dormant until integrated via the feature flag in subsequent PRs. All existing NodeHandler behavior continues unchanged.
Unit Tests Added or Updated (Yes / No)
Additional Testing Performed
List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).
Change Behind Feature Flag (Yes / No)
Yes - The strategy framework will be activated via a feature flag in the integration PR (to be shared). This allows for safe rollback and testing between legacy and new strategy approache.
Tech Design / Approach
Documentation Changes Required (Yes/No)
Logging Added/Updated (Yes/No)
Telemetry Added/Updated (Yes/No)
Rollback Scenario and Process (Yes/No)
Dependency Impact Assessed and Regression Tested (Yes/No)