Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shorter hashes with CPM_SOURCE_CACHE #624

Open
codeinred opened this issue Dec 9, 2024 · 3 comments
Open

Use shorter hashes with CPM_SOURCE_CACHE #624

codeinred opened this issue Dec 9, 2024 · 3 comments

Comments

@codeinred
Copy link

Some systems have pathname limits (eg, windows). If a project has multiple levels of nested directories, it can be easy to hit those limits, and having a 40-character hash in the pathname contributes to this problem.

Shorter hashes do increase the chance of a collision, therefore I propose the following solution:

  1. Take an input hash
  2. Take gradually longer substrings until we've identified the shortest substring that uniquely identifies the hash (eg, trying a substring of length 4, then 8, then 12, and so on)

In most cases, we can use a much shorter hash (typically 4 characters). Occasionally, a longer hash (8 or 12 characters) may be used. Very rarely will anything longer be used.

In the event of multiple collisions, where the pathname becomes too long, the user can clear the cache directory and try again (this will ensure that the version they're building with is assigned the shortest hash)

Code

We can do this with the following cmake.

# Find the shortest hash that can be used
# eg, if origin_hash is cccb77ae9609d2768ed80dd42cec54f77b1f1455
# the following files will be checked, until one is found that
# is either empty (allowing us to assign origin_hash), or whose contents matches
# ${origin_hash}
#
# - .../cccb.hash
# - .../cccb77ae.hash
# - .../cccb77ae9609.hash
# - .../cccb77ae9609d276.hash
# etc
# We will be able to use a shorter path with very high probability, but in the
# (rare) event that the first couple characters collide, we will check
# longer and longer substrings.
function(cpm_get_shortest_hash source_cache_dir origin_hash short_hash_output_var)
  foreach(len RANGE 4 40 4)
    string(SUBSTRING "${origin_hash}" 0 ${len} short_hash)
    set(hash_lock ${source_cache_dir}/${short_hash}.lock)
    set(hash_fp ${source_cache_dir}/${short_hash}.hash)
    file(LOCK ${hash_lock})

    # Load the contents of .../${short_hash}.hash
    file(TOUCH ${hash_fp})
    file(READ ${hash_fp} hash_fp_contents)

    if(hash_fp_contents STREQUAL "")
      # Write the origin hash
      file(WRITE ${hash_fp} ${origin_hash})
      file(LOCK ${hash_lock} RELEASE)
      break()
    elseif(hash_fp_contents STREQUAL origin_hash)
      file(LOCK ${hash_lock} RELEASE)
      break()
    else()
      file(LOCK ${hash_lock} RELEASE)
    endif()
  endforeach()
  set(${short_hash_output_var} "${short_hash}" PARENT_SCOPE)
endfunction()

Then, we can update CPMAddPackage to use the shorter hash. This change is minimal:

image
    # ...
    elseif(CPM_USE_NAMED_CACHE_DIRECTORIES)
      string(SHA1 origin_hash "${origin_parameters};NEW_CACHE_STRUCTURE_TAG")
      cpm_get_shortest_hash("${CPM_SOURCE_CACHE}/${lower_case_name}" "${origin_hash}" origin_hash)
      set(download_directory ${CPM_SOURCE_CACHE}/${lower_case_name}/${origin_hash}/${CPM_ARGS_NAME})
    else()
      string(SHA1 origin_hash "${origin_parameters}")
      cpm_get_shortest_hash("${CPM_SOURCE_CACHE}/${lower_case_name}" "${origin_hash}" origin_hash)
      set(download_directory ${CPM_SOURCE_CACHE}/${lower_case_name}/${origin_hash})
    endif()
@Avus-c
Copy link
Contributor

Avus-c commented Dec 10, 2024

I've encountered this issue multiple times and support implementing a better solution on Windows machines than simply "movíng the project to another location."

If the proposed solution turns out to be too complex, I wouldn't mind a simpler, straightforward approach. For example, limiting the hash to a fixed shorter length ( e.g. 8 ) with a cache variable (CPM_USE_SHORT_HASH).
While this might not resolve the issue for every project, it would likely cover most of the cases.

@TheLartians
Copy link
Member

Thanks for raising the issue, I definitely see how including the full hash may easily break OS limitations. As collisions can create unexpected and hard to debug issues, I currently prefer something like the first solution presented, even though it adds a bunch of code.

@codeinred
Copy link
Author

@TheLartians I can create a PR and add some tests for the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants