Add peeks for idents #367

Daniil159x · 2024-12-23T21:59:03Z

Hi!
I found the project very useful. And my usual workflow looks like that: click on some ident and open N new tabs for reference/definition. But in other code view apps I can "peek" source (e. g. https://code.visualstudio.com/docs/editor/editingevolved#_peek, or github).
I have added this feature in single page and dynamic popup:

The changes bring performance impact, but i hope caching handles it.
You can try it on my instance: https://elixir.ravin.gs.

tleb · 2024-12-25T10:47:45Z

Thanks for that rather consequent feature! I'll be able to give feedback and discuss next week.

You might have seen we faced some annoying performance issues recently (#365), so we'll have to let it rest a bit to add features that worsen performance.

fstachura · 2025-01-30T14:04:57Z

Hello, it seems that we have fixed most of the performance issues and Elixir has more resources now.
Internal feedback about this feature is very positive.

The changes bring performance impact, but i hope caching handles it.

The problem is, the HTTP frontend cache won't really help us here. Most requests stay there only for a very short time (especially now, that Elixir is DDOSed by bots that scrap for AI datasets). We would probably need to introduce a caching layer just for this, or merge the feature as-is.

Do you maybe have any benchmarks that compare load and request times of mainline Elixir and your version?

Daniil159x · 2025-01-31T20:07:43Z

Hello!
I have the benchmark for current state of this PR and its base (7be0bfe). So the benchmark doesn't take in the caching from commit 05f7ad4. And I measured without the Varnish server.

I made 200 requests per endpoint:

/api/ident/linux/loopback?version=v5.18.19&family=C
/api/ident/linux/devm_register_reboot_notifier?version=v5.18.19&family=C
/api/ident/linux/notrace?version=v5.18.19&family=C
/api/ident/linux/arch_local_irq_restore?version=v5.18.19&family=C
/api/ident/linux/blk_queue_dma_alignment?version=v5.18.19&family=C
/api/ident/linux/spinlock_t?version=v5.18.19&family=C
/api/ident/linux/max?version=v5.18.19&family=C
/api/ident/linux/task_struct?version=v5.18.19&family=C
/api/ident/linux/eth_header?version=v5.18.19&family=C
/api/ident/linux/sk_buff?version=v5.18.19&family=C
/linux/v5.18.19/A/ident/loopback
/linux/v5.18.19/A/ident/devm_register_reboot_notifier
/linux/v5.18.19/A/ident/notrace
/linux/v5.18.19/A/ident/arch_local_irq_restore
/linux/v5.18.19/A/ident/blk_queue_dma_alignment
/linux/v5.18.19/A/ident/spinlock_t
/linux/v5.18.19/A/ident/max
/linux/v5.18.19/A/ident/task_struct
/linux/v5.18.19/A/ident/eth_header
/linux/v5.18.19/A/ident/sk_buff

And I got the following statistics, for vanilla elixir:

mean       73.568000
std        25.159807
min        51.000000
25%        54.000000
50%        56.000000
75%        96.000000
max       183.000000

and for my PR:

mean      261.172000
std       321.407392
min        90.000000
25%       108.000000
50%       143.500000
75%       158.000000
max      1257.000000

And this is the distribution I got:

ident/loopback and ident/arch_local_irq_restore are special because they have many refs so they have many peeks (see the links).

fstachura · 2025-02-05T19:06:21Z

It seems that the worst case can be measured in seconds, and that could be a problem. I tried to think what can be done with that. @Daniil159x If you have any ideas, please post them.
So far I tried running git gc with a low gc.aggressiveDepth value, but it actually made things a bit worse (although just by a millisecond per access). Earlier I had an idea to maybe put the lines in a db, but that most likely would massively blow up the database size. What I tried in the end, was dulwich, as I suspect forking to bash and git for each file (up to 200 times per request) does not work in our favor.

In my tests, it decreased max request time by ~2-4 times (on the set of URLs you posted) and mean by ~2. Depends on whether the repository was recently gc'd or not. Anecdotally, get-file takes around 10 ms on my machine, meanwhile accessing the file with dulwich usually takes around 1-3 ms.

script.sh:

mean 253.165632
max 994.465
min 43.989

dulwich:

mean 133.854528
max 421.05
min 68.806

Calling git from subprocess:

mean 155.652749
max 890.089
min 67.825

I used this script (but with 50 in range argument). Results vary depending on size of the repo, and whether it was recently GCd or not, but worst case scenario with dulwich for many files was always better by at least ~250 ms when compared with just calling git (without script.sh). But to be honest, the first access after creating the repo can take ~100 ms, while the library is in read_packed_refs_with_peeled. I suspect that this can maybe be fixed, since git does not seem to have that problem. We can switch between implementations depending on how many idents have to be scanned.

content = get_object_by_path(self.repo, '/' + sym.path, self.repo[b'refs/tags/' + version.encode()].object[1]).data.split(b"\n")

@tleb I know you are against using alternative git implementations, and I get why, but maybe we could make an exception here. I can understand that calling git is much preferred in update for example, but just for querying tags and blobs it maybe makes sense. Git repo format is documented here, packfiles are here. The first link suggests that git actually tries to keep backward compatibilty with old implementations, see "Git Repository Format Versions".

fstachura · 2025-02-05T12:39:32Z

elixir/api.py

@@ -46,14 +46,15 @@ def on_get(self, req, resp, project, ident):
        if version == 'latest':
            version = query.query('latest')

-        symbol_definitions, symbol_references, symbol_doccomments = query.query('ident', version, ident, family)
+        symbol_definitions, symbol_references, symbol_doccomments, peaks = query.query('ident', version, ident, family)


Typo: peaks

fstachura · 2025-02-05T12:58:53Z

elixir/query.py

@@ -402,10 +405,37 @@ def get_idents_defs(self, version, ident, family):
            symbol_doccomments.append(SymbolInstance(path, docline))

        return symbol_definitions, symbol_references, symbol_doccomments
+


Nit: Redundant whitespace (not the line, but trailing spaces)

elixir/query.py

fstachura · 2025-02-05T15:06:37Z

static/dynamic-references.js

@@ -66,7 +66,14 @@ function generateSymbolDefinitionsHTML(symbolDefinitions, project, version) {
        result += `<li><a href="/${project}/${version}/source/${sd.path}#L${ln[0]}"><strong>${sd.path}</strong> <em>(as a ${sd.type})</em></a>`;
        result += '<ul>';
        for(let n of ln) {
-          result += `<li><a href="/${project}/${version}/source/${sd.path}#L${n}">line ${n}</a></li>`;
+          result += `<li><a href="/${project}/${version}/source/${sd.path}#L${n}"><span>line ${n}</span>`;
+          let srcLine = peeks?.[sd.path]?.[n];


Optional chaining is supported by 95.5% of browsers. What do you think @tleb? Our current support cutoff is closer to 97, I think (I still don't have a good way to evaluate).

Daniil159x · 2025-02-05T22:49:17Z

@fstachura I have an idea.
Currently we are calling script.sh get-file <version> <path> for each file, however git cat-file has batch mode

I've done some experiments:

100 calls to script.sh:

real    0m0.585s
user    0m0.201s
sys     0m0.038s

one batch call:

real    0m0.061s
user    0m0.001s
sys     0m0.051s

batch is 10 times faster
so I think we can minimize the worst case to ~100ms.

fstachura · 2025-02-05T23:57:13Z

Yeah, batch mode sounds like a pretty good idea too, I didn't know that it exists. I suspect a lot of the overhead comes from forking so many times during the request. If the endpoint forks only a single time, maybe performance won't suffer that much.

tleb · 2025-02-06T14:06:02Z

That --batch is a great fine indeed!

⟩ cd linux/
⟩ git ls-tree -r HEAD | awk '$2=="blob"{print $3}' | head -n1000 > /tmp/blobs
⟩ hyperfine --warmup 3 'git cat-file --batch </tmp/blobs' 'xargs -a /tmp/blobs -n1 git cat-file blob'
Benchmark 1: git cat-file --batch </tmp/blobs
  Time (mean ± σ):      52.2 ms ±   4.3 ms    [User: 36.1 ms, System: 15.8 ms]
  Range (min … max):    47.4 ms …  67.1 ms    43 runs

Benchmark 2: xargs -a /tmp/blobs -n1 git cat-file blob
  Time (mean ± σ):      2.386 s ±  0.082 s    [User: 0.780 s, System: 1.571 s]
  Range (min … max):    2.286 s …  2.521 s    10 runs

Summary
  git cat-file --batch </tmp/blobs ran
   45.72 ± 4.12 times faster than xargs -a /tmp/blobs -n1 git cat-file blob

Daniil159x · 2025-02-18T20:43:40Z

Hi! I rewrote the code to use the batch mode of git cat-file.
And got following results:

The slowdown is mostly ~30ms, but the loopback identifier is special anyway.
@tleb and @fstachura wait for your comments.

fstachura requested changes Feb 5, 2025

View reviewed changes

Daniil159x added 5 commits February 15, 2025 18:09

intro query

8a59870

use peeks in dynamic ref

06be412

use peeks in jinja

af7c8f3

use batch

cd3230b

fix review

1c05d62

Daniil159x force-pushed the feature/peek_for_ident branch from a270c3c to 1c05d62 Compare February 15, 2025 20:32

Daniil159x requested a review from fstachura February 22, 2025 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add peeks for idents #367

Add peeks for idents #367

Daniil159x commented Dec 23, 2024

tleb commented Dec 25, 2024

fstachura commented Jan 30, 2025 •

edited

Loading

Daniil159x commented Jan 31, 2025

fstachura commented Feb 5, 2025 •

edited

Loading

fstachura Feb 5, 2025

fstachura Feb 5, 2025

fstachura Feb 5, 2025

Daniil159x commented Feb 5, 2025

fstachura commented Feb 5, 2025 •

edited

Loading

tleb commented Feb 6, 2025 •

edited

Loading

Daniil159x commented Feb 18, 2025

		@@ -402,10 +405,37 @@ def get_idents_defs(self, version, ident, family):
		symbol_doccomments.append(SymbolInstance(path, docline))

		return symbol_definitions, symbol_references, symbol_doccomments

Add peeks for idents #367

Are you sure you want to change the base?

Add peeks for idents #367

Conversation

Daniil159x commented Dec 23, 2024

tleb commented Dec 25, 2024

fstachura commented Jan 30, 2025 • edited Loading

Daniil159x commented Jan 31, 2025

fstachura commented Feb 5, 2025 • edited Loading

fstachura Feb 5, 2025

Choose a reason for hiding this comment

fstachura Feb 5, 2025

Choose a reason for hiding this comment

fstachura Feb 5, 2025

Choose a reason for hiding this comment

Daniil159x commented Feb 5, 2025

fstachura commented Feb 5, 2025 • edited Loading

tleb commented Feb 6, 2025 • edited Loading

Daniil159x commented Feb 18, 2025

fstachura commented Jan 30, 2025 •

edited

Loading

fstachura commented Feb 5, 2025 •

edited

Loading

fstachura commented Feb 5, 2025 •

edited

Loading

tleb commented Feb 6, 2025 •

edited

Loading