Skip to content

Conversation

@rishabhmalikMS
Copy link
Contributor

@rishabhmalikMS rishabhmalikMS commented Dec 8, 2025

Context

This PR introduces the Node.js version selection strategy framework as part of modernizing Azure Pipelines Agent's Node.js handling. This is the first implementation step following comprehensive test specifications established in previous work. The strategy pattern provides a foundation for implementing EOL (End-of-Life) Node.js version policies andnode handler selection across both host and container environments.

Related Work: This builds on test specifications for Node Handler behavior and enables future EOL policy enforcement and node selection logic.

Workitem: AB#2339020


Description

Implemented comprehensive strategy pattern for Node.js version selection with the following components:

Core Strategy Framework:

  • INodeVersionStrategy.cs - Strategy interface defining CanHandle() and GetNodePath() contracts
  • NodeContext.cs - Context model containing environment state, glibc compatibility flags, and handler data
  • NodePathResult.cs - Result model with node path, version, selection reason, and optional warnings

Strategy Implementations:

  1. Node24Strategy.cs - Latest Node24 support with glibc fallback logic and EOL policy compliance
  2. Node20Strategy.cs - Current default Node20 with glibc error handling
  3. Node16Strategy.cs - Legacy Node16 support with EOL warnings
  4. Node10Strategy.cs - Node10 support for Alpine compatibility with EOL policies
  5. Node6Strategy.cs - Legacy Node6 support with EOL enforcement

Key Features:

  • EOL Policy Support - Configurable blocking of end-of-life Node.js versions via AGENT_ENABLE_EOL_NODE_VERSION_POLICY
  • Glibc Compatibility - Automatic fallback handling for Node20/24 glibc errors (requires glibc 2.17+/2.28+)
  • Container/Host Unification - Single strategy pattern works for both container and host execution
  • Priority-Based Selection - Strategies evaluated in optimal order: Node24 → Node20 → Node16 → Node10 → Node6
  • Comprehensive Logging - Detailed debug output for selection decisions and fallback reasoning

Risk Assessment (Low / Medium / High)

Low Risk - This PR introduces only new strategy implementation code without modifying existing NodeHandler logic. The strategies remain dormant until integrated via the feature flag in subsequent PRs. All existing NodeHandler behavior continues unchanged.


Unit Tests Added or Updated (Yes / No)

  • No unit tests added in this PR. Test infrastructure and comprehensive scenario validation will be implemented in subsequent integration PR to validate equivalence between legacy and strategies.
  • Previous PR #5421 has covered all scenarios for both legacy testing. The same tests will be run on the new implementation once integration is completed to ensure behavior is intact and as per expectations where is it suppoed to differ.
  • Further, additional tests will be added for equivalency and divergence to ensure legacy and new strategy appraoch works as per expectations. These will utilize the exisitng test scenarios we have added as a part of PR: #5421

Additional Testing Performed

List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).


Change Behind Feature Flag (Yes / No)

Yes - The strategy framework will be activated via a feature flag in the integration PR (to be shared). This allows for safe rollback and testing between legacy and new strategy approache.


Tech Design / Approach


Documentation Changes Required (Yes/No)


Logging Added/Updated (Yes/No)

  • Yes. Appropriate logs have been added.

Telemetry Added/Updated (Yes/No)

  • Custom telemetry (e.g., counters, timers, error tracking) is added as needed.
  • Events are tagged with proper metadata for filtering and analysis.
  • Telemetry is validated in staging or test environments.

Rollback Scenario and Process (Yes/No)

  • Rollback plan is documented.

Dependency Impact Assessed and Regression Tested (Yes/No)

  • All impacted internal modules, APIs, services, and third-party libraries are analyzed.
  • Results are reviewed and confirmed to not break existing functionality.

…or node strategy and response data models for each strategy result
@rishabhmalikMS rishabhmalikMS changed the title Adding Strategies for node 24 to node 6 | Adding Interfaces created f… Add unified Node.js version selection strategy framework with EOL policy support Dec 8, 2025
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@microsoft microsoft deleted a comment from azure-pipelines bot Dec 9, 2025
@microsoft microsoft deleted a comment from azure-pipelines bot Dec 9, 2025
@microsoft microsoft deleted a comment from azure-pipelines bot Dec 9, 2025
@microsoft microsoft deleted a comment from azure-pipelines bot Dec 9, 2025
@rishabhmalikMS
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rishabhmalikMS
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rishabhmalikMS rishabhmalikMS marked this pull request as ready for review December 9, 2025 05:02
@rishabhmalikMS rishabhmalikMS requested review from a team as code owners December 9, 2025 05:02
nameof(EnableEOLNodeVersionPolicy),
"When enabled, automatically upgrades tasks using end-of-life Node.js versions (6, 10, 16) to supported versions (Node 20.1 or Node 24). Throws error if no supported versions are available on the agent.",
new PipelineFeatureSource("AGENT_RESTRICT_EOL_NODE_VERSIONS"),
new EnvironmentKnobSource("AGENT_RESTRICT_EOL_NODE_VERSIONS"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DO we need both sources for toggle to work ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning to do E2E test for this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One knob is used as a feature flag to enable/disable the feature all together. Another is used to read the value of toggle from UI injected from server side. The same feature flag can be used to enable/disable Ui toggle & agent side functionality to maintain consistency on both ends.

Regarding E2E testing: Planning to use devfabric for updated server-side logic along with updated agent logic to test integrated functionality.

/// <summary>
/// True if Node20 has glibc compatibility errors (requires glibc 2.17+).
/// </summary>
public bool Node20HasGlibcError { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO These are node specific and should not be part of the common context I

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but from implementation perspective. If we remove glibc compatibility flags from NodeContext and try to move inside individual node 24 and 20 strategies, it will cause following issues:

  1. Duplicate runtime checks for these: these are required in both legacy and new approach as well in both node 24 and node 20 strategies. We would end up having the same runtime checks at 3 places instead of 1.
  2. Sames checks would run multiple times in both node 24 and node 20 handling. And if we somehow try to reuse by having it at single place we would need to pass them as context to node 24 and node 20 strategies along with legacy code. Which means this will end up outside the individual strategy code boundry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a clean design here instead of optimizing for implementation. Three checks should not be an issue IMO.

Comment on lines 66 to 69
/// <summary>
/// The selected node version determined by CanHandle().
/// Examples: "node24", "node20_1", "node16", "node10", "node"
/// </summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are output and I assumed taht this class is imput context so we shold not merge these 2 into single object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the proposed fix for this:
Removing 3 fields from NodeContext: SelectedNodeVersion, SelectionReason and SelectionWarning
Since these are used in NoeRunnerInfo, will remove these from NodeContext and reuse from NodeRunnerInfo in strategy code.

As this will require updates in Orchestrator to run the strategies which utilizes results from strategies functions, will incorporate this change in next PR for orchestrator + node handler integration immediately following this PR closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes for this are in this next PR #5425

/// </summary>
/// <param name="context">Context with environment, task, and glibc information</param>
/// <returns>True if this strategy can handle the context, false otherwise</returns>
bool CanHandle(UnifiedNodeContext context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input should be taskcontext and output should NodePathResult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently canHandler function determines if the corresponding node can be selected or not based on boolean value. We have separate function getNodePath in each strategy which frames the node path based on the node selected. This is done to keep 2 operations separate:

  1. Whether a node can be used or not
  2. If it can be used if have getNodePath function which generate the path based on host/container environment and returns the result containing the same

If we aim to return NodePathResult (NodeRunnerInfo now after naming convention updates), this would not allow us to have single responsibilty for the canHandle and getNodePath functions.

/// Unified context for both host and container node selection.
/// Contains runtime data - strategies read their own knobs via ExecutionContext.
/// </summary>
public sealed class UnifiedNodeContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming the node is selcted based on task.json, so this should be TaskContext IMO ? also does it need anything else apart from what is defined in task.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskContext might be misleading here, as the data model contains details required in node selection process and not general task execution.
Regarding "I am assuming the node is selected based on task.json", yes, node is selected based on the handler we get from task.json processed by task runner and passed to node handler which is used in both existing logic and the new node strategy classes for taking decisions.


if (eolPolicyEnabled)
{
throw new NotSupportedException(StringUtil.Loc("NodeEOLFallbackBlocked", "Node20", "Node16"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are deliberately throwing an exception from the worker process, have we validated this through a pipeline run to confirm whether the worker actually crashes or returns a failure code? My main concern is whether all failure scenarios have been thoroughly tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure scenarios are the ones where we are unable to find any node available, either due to compatibility issues, end of life nodes being disabled. In these, we are throwing not supported exception with relevant message. Currently L0 tests are available. Will be performing E2E tests and actual pipeline run as well for further testing related work.

string systemType = context.IsContainer ? "container" : "agent";
context.SelectedNodeVersion = "node16";
context.SelectionReason = $"{baseReason}, fallback to Node16 due to Node20 glibc compatibility issue";
context.SelectionWarning = StringUtil.Loc("NodeGlibcFallbackWarning", systemType, "Node20", "Node16");
Copy link
Contributor

@raujaiswal raujaiswal Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using Node 20.1, there are references to both Node20 and Node20_1 in various places. can you please check if we can make these references consistent across all locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take this as a part of next PR.

@raujaiswal
Copy link
Contributor

raujaiswal commented Dec 10, 2025

@rishabhmalikMS ,can you please check if we can add additional Kusto telemetry to capture the customer’s Node version? This would allow us to diagnose issues more quickly without relying on log checks, improving investigation speed and accuracy.

@rishabhmalikMS
Copy link
Contributor Author

@rishabhmalikMS ,can you please check if we can add additional Kusto telemetry to capture the customer’s Node version? This would allow us to diagnose issues more quickly without relying on log checks, improving investigation speed and accuracy.

Will check on which information would be relevant and helpful to publish in telemetry. Will take this as part of future pull request. Created a work item (https://mseng.visualstudio.com/AzureDevOps/_workitems/edit/2340945) for the same.

@rishabhmalikMS rishabhmalikMS changed the title Add unified Node.js version selection strategy framework with EOL policy support Add Node.js version selection strategy framework with EOL policy support Dec 10, 2025
/// <summary>
/// The handler data from the task definition.
/// </summary>
public BaseNodeHandlerData HandlerData { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contains the handler type available to task.

Comment on lines 51 to 59
/// <summary>
/// Host context for directory lookups.
/// </summary>
public IHostContext HostContext { get; set; }

/// <summary>
/// Execution context for logging, warnings, and knob reading.
/// </summary>
public IExecutionContext ExecutionContext { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this in NodeContext ? The decision for which node to use is based on Container/OS and task.json, so ideally we should not have anything beyond these in this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HostContext and ExecutionContext are removed from TaskContext.

/// Context for both host and container node selection.
/// Contains runtime data - strategies read their own knobs via ExecutionContext.
/// </summary>
public sealed class NodeContext
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it called Node context? this is created per task if I am not wrong, can it be called Task context in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeContext is renamed as TaskContext

/// Explanation of why this version was selected.
/// Used for debugging and telemetry.
/// </summary>
public string Reason { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this variable name more relevant?


public sealed class Node6Strategy : INodeVersionStrategy
{
public string Name => "Node6";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should move these string literals to a constants file. This way, we can reuse them consistently, as I see they are being used here as well as in the logs. It would make tracking and maintenance much easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name field is removed. We are now using strategy.GetType().name instead of this. This was essentially used for logging purposes. No more needed.

return false;
}

private bool DetermineNodeVersionAndSetContext(NodeContext context, bool eolPolicyEnabled, string baseReason)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know your thoughts on whether we should move this to a common unified location. It is currently declared four times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DetermineNodeVersionAndSetContext will be required in each strategy as it has custom code for each strategy. This cannot be moved to another place

… logic

- Passing execution and host context to constructors of strategies
- Add GlibcCompatibilityChecker for unified glibc compatibility testing across host and container environments
- Introduce GlibcCompatibilityInfo data model to encapsulate Node20/24 glibc compatibility results
- Simplified NodeContext (now TaskContext) to provide runtime context (handler data, container info and step target only) to strategies
- Support both host glibc testing and container pre-computed glibc flags (NeedsNode20Redirect/NeedsNode16Redirect)
- Remove redundant IsContainer field, use Container != null for cleaner logic
- Moved GetNodePath to orchestrator as CreateNodeRunnerInfoWithPath

GlibcCompatibilityInfo glibcInfo = GlibcCompatibilityInfo.Compatible;

if (context.Container == null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeVersionOrchestrator.cs is selector class, responsible for selecting an appropriate strategy. can the creation of glibcInfo moved to strategy concrete class (or a parent class) to have this class single responsibility

bool node20HasGlibcError = false;
bool node24HasGlibcError = false;

if (!useNode20InUnsupportedSystem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these checks be reversed first validate Node24 then Node20?

{
if (_supportsNode20.HasValue)
{
node20HasGlibcError = !_supportsNode20.Value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this __supportsNode20 initialized?

return DetermineNodeVersionSelection(context, eolPolicyEnabled, "Selected for Node20 task handler", glibcInfo);
}

if (eolPolicyEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't eolpolicy be chedked before hasnode20handler? or it should be first check?

_strategies.Add(new Node20Strategy());
_strategies.Add(new Node16Strategy());
_strategies.Add(new Node10Strategy());
_strategies.Add(new Node6Strategy());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my understanding I think from the priority in the strategy , we moved to this order
shouldn't we add a comment here to make sure new node versions are added in proper priority order?

…run using a feature flag knob value

- Added test integration to run Test scenarios on both original and strategy code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants