Skip to content

Define most of Pathname in Ruby (redo) #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

eregon
Copy link
Member

@eregon eregon commented Jul 16, 2025

Same as #53, but that was reverted in 593f030.
Should be reviewed commit-by-commit, that makes it much clearer which parts of the code are new, and which are from the original pathname.rb before translation to C began.

Please review it.
If there is no review within a week I'll assume everyone agrees with the PR in the current state.

I cherry-picked the commits to make it easier to review.

Description from the original PR, reordered to have the most important first:

Once upon a time, Pathname was pure-Ruby: https://github.com/ruby/ruby/blob/95bc02237635d3fe42532bfe53038257575cee75/lib/pathname.rb

This PR goes back to that, but keeps the C extension implementation of <=> as that one is significantly faster.
The other Pathname methods are actually faster in Ruby than in C, because all these methods just do rb_funcall() and rb_ivar_get() and those in C code have no inline cache, but the corresponding method calls and @path have inline caches in Ruby code.
https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation of that (though it was known well before that).

I have discussed this with @akr several times (notably in https://bugs.ruby-lang.org/issues/17473) and the last time he said it was OK to do this change.
The main goals are:

  • Simplify the implementation, e.g. the Ruby version is 3 times smaller in terms of lines and is much easier to read and maintain.
  • Share more of the Pathname implementation between Ruby implementations. Other Ruby implementations can then easily be added in CI later. Currently the pathname gem does not work on JRuby (no C ext support) and on TruffleRuby (some Ruby C API functions that this gem uses are not supported), this will be a huge help towards supporting both.

I worked to make the diff really clean, it only adds lines in lib/pathname.rb and only removes lines in ext/pathname/pathname.c. That way it should be easy to review it.
I restored the Ruby implementation of the methods from ed9270a, the commit just before methods started being migrated to the C extension.
I then fixed things to make the test suite pass and implemented the few missing methods based on their C definition.
The individual commits and their messages make it clear what exactly happened, so I would recommend to review commit-by-commit.


From my discussions with @akr, IIRC, the original motivation to rewrite pathname.rb to C, besides the optimization for <=>, was apparently to use *at functions like openat (see man openat, Rationale for openat() and other directory file descriptor APIs) but these are not portable, it did not happen, and is only useful in very rare edge cases.
The Ruby Dir class could potentially support some of that, but it seems it has never been important enough for someone to implement it.
The API of Pathname would anyway also need to change to take advantage of a working directory different than the process CWD, e.g. Pathname methods would need to take an extra "Pathname to use as working directory" argument.
(because if one just uses Pathname("relative/path").open(...) there is no point to use *at() functions).


It's significantly faster with this PR (first line is this branch, second line is master):

Structure:
benchmark name
command line
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
this branch
master
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
this branch
master

Pathname.new(".")
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'Benchmark.ips { it.report { Pathname.new(".") } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
2.093M (± 0.9%) i/s  (477.76 ns/i) -     10.622M in   5.075014s
1.762M (± 0.6%) i/s  (567.54 ns/i) -      8.858M in   5.027444s
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
 32.078Q (±15.6%) i/s    (0.00 ns/i) -     39.570Q (optimizes away)
720.391k (±17.3%) i/s    (1.39 μs/i) -      3.522M in   5.050059s

Pathname#directory?
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.directory? } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
388.278k (± 0.2%) i/s    (2.58 μs/i) -      1.945M in   5.009322s
366.325k (± 0.2%) i/s    (2.73 μs/i) -      1.843M in   5.030526s
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
448.926k (± 1.1%) i/s    (2.23 μs/i) -      2.244M in   4.998573s
314.099k (± 2.9%) i/s    (3.18 μs/i) -      1.574M in   5.015517s

Pathname#to_s
$ ruby --yjit -Ilib -rpathname -rbenchmark/ips -e 'P = Pathname.pwd; Benchmark.ips { it.report { P.to_s } }'
ruby 3.4.2 (2025-02-15 revision d2930f8e7a) +YJIT +PRISM [x86_64-linux]
9.480M (± 0.4%) i/s  (105.49 ns/i) -     48.196M in   5.084075s
3.977M (± 1.4%) i/s  (251.46 ns/i) -     20.029M in   5.037328s
truffleruby 24.2.1, like ruby 3.3.7, Oracle GraalVM JVM [x86_64-linux]
31.854Q (±15.5%) i/s    (0.00 ns/i) -     39.901Q (optimizes away)
 1.184M (±13.7%) i/s  (844.61 ns/i) -      5.805M in   5.006740s

eregon added 11 commits July 16, 2025 09:35
* This is just before methods started to be moved from Ruby code to the C extension.
* BTW, in the ruby/pathname repository there was no pathname.rb before that commit.

(cherry picked from commit 16e97a5)
* This means it's only additions in lib/pathname.rb and zero removals.

(cherry picked from commit 3736eab)
(cherry picked from commit 955186c)
* The <=> implementation in the extension is much faster, so is kept.
* The other methods are actually faster in Ruby than in C,
  because rb_funcall() and rb_ivar_get() in C code have no inline cache,
  but method calls and `@path` have inline caches in Ruby code.
  https://railsatscale.com/2023-08-29-ruby-outperforms-c/ is an explanation
  of that (though it was known well before that).

(cherry picked from commit c8c2210)
* Avoids a MatchData allocation.

(cherry picked from commit 643585a)
@eregon eregon requested review from akr, hsbt, nobu and byroot July 16, 2025 07:43
Copy link
Member

@byroot byroot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly fine to me besides a few nitpicks, and I'm very much in favor of migrating things to pure Ruby when it makes sense.

Not sure how this works with Pathname having been made a core class though.

Comment on lines +288 to +294
begin
old = Thread.current[:pathname_sub_matchdata]
Thread.current[:pathname_sub_matchdata] = $~
eval("$~ = Thread.current[:pathname_sub_matchdata]", block.binding)
ensure
Thread.current[:pathname_sub_matchdata] = old
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what this does, but it's super scary. Perhaps it's outdated and no longer necessary?

Copy link
Member Author

@eregon eregon Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sets $~ in the block, removing it causes this test to fail:

def test_sub_matchdata
result = Pathname("abc.gif").sub(/\..*/) {
assert_not_nil($~)
assert_equal(".gif", $~[0])
".png"
}
assert_equal("abc.png", result.to_s)
end

It's a bit hacky but that's the original code and I don't see a better solution.
I could restore the C version of #sub if desired, but I'd need to keep the pure-Ruby version for e.g. JRuby.
(I didn't know one can set $~, TIL)

@eregon eregon force-pushed the pure-ruby-pathname2 branch from 37ebd64 to 834cc54 Compare July 16, 2025 20:14
@eregon
Copy link
Member Author

eregon commented Jul 16, 2025

@byroot Thank you for the review, I think I addressed all of it.

@hsbt and/or @nobu Could you review this PR as well?

@eregon
Copy link
Member Author

eregon commented Jul 18, 2025

@hsbt Let's discuss your concerns and suggestions here.
You said in https://bugs.ruby-lang.org/issues/17473#note-27

Please separate the small PRs. I want to reduce the side effect like ruby/ruby#13906.

Can you make a concrete suggestion by what you mean by small PRs for this change?

I could make a PR with fewer commits, but every commit until Handle Windows NTFS edge case in Pathname#sub_ext is strictly necessary, otherwise the CI doesn't pass.
That leaves only Optimize Pathname#initialize to avoid extra send and Optimize Pathname#initialize to avoid extra ivar accesses which are trivial, and then commits to address @byroot's review.

If you are asking a smaller diff in general I think that is not feasible, e.g. making a PR per method would take months of work and still be the exact same end result. The approach here as detailed in the first commit message, Restore lib/pathname.rb from ext/pathname/lib/pathname.rb at ed9270a is to use the Ruby code of pathname.rb from before the translation to C. There is no meaningful way to break that in smaller changes. And that code has already been reviewed, it was exactly the code in Pathname before the translation to C started.

Please take the time to read the commit messages, they should make it very clear what I did and what needs deeper review (e.g. imported code from the gem as-is doesn't).
Just browsing through the commit messages should also make it clear I took great care to have a very clear git history of the changes here with not a single extra line of diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants