Skip to content

Conversation

@mesmorkalov
Copy link

Issue description:

'DONT_CARE' locality enforcement option is ignored when client specifies it in the resource request. The problem is that after YARN allocates the resources they are matched by Llama against the request queue and the matching pattern includes node names.

Patch description:

Note: Addresses only the configuration with both resource caching and normalization disabled.

  1. Introduce internal collection for ‘DONT_CARE’-attributed requests which were not granted yet.
  2. After resources allocated by YARN have been ‘strongly’ matched against the existing requests, match the remaining of them against the new collection in a ‘weak’ fashion (only by number of vCPUs and amount of memory)
  3. Matched resources are taken by Llama and removed from the new collection, while the rest of them are returned back into YARN pool.
  4. Fix the inconsistency between the Llama source code and Llama config w.r.t. name (llama.am.caching.enabled vs llama.am.cache.enabled) of the property responsible for resource caching – currently the property change is ignored due to this
  5. Add new config property (llama.am.resource.normalizing.enabled.#QUEUE#) to make disabling normalization for particular queue possible (rather than disabling normalization globally for all queues).

Signed-off-by: Mikhail Smorkalov [email protected]


public static final String CACHING_ENABLED_KEY =
PREFIX_KEY + "caching.enabled";
PREFIX_KEY + "cache.enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. Can we also update the reference to "caching.enabled" in llama-site.xml as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kambatla
Copy link
Contributor

kambatla commented Apr 6, 2016

Can we add unit tests to verify this gives us the desired behavior. If that turns out to be hard, it would be nice to know why. Also, would be good to hear any manual validation.

@mesmorkalov
Copy link
Author

The problem with the additional unit test is that the existing test infrastructure seems to miss the ability to generate different names for the fake nodes. When I create miniYarn configuration with several nodes, all of them have the same name (the name of the host where I am running the tests). Modification of test infrastructure seems to be not trivial and may result in significantly more efforts than fixing the actual issue, so your ideas on how to overcome this are very much appreciated.
As for the manual validation, I've run the following test on real cluster:

  • disable resource caching and normalization
  • two requests are sent by Llama client to LlamaAM with the same node name specified as the 'location' and 'DONT_CARE' specified for the 'locality'.
  • both requests are granted and the second one got the resources from the node different from the originally specified one.

I have also checked that the fix doesn't affect the default scenario with resource caching and normalization enabled (since in this case we simply don't have requests with DONT_CARE or PREFERRED locality as MUST is hard-coded when creating normalized request).
Please let me know if I missed some scenarios which should be taken into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants