Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online library filter/search API calls are not optimal #957

Open
kelson42 opened this issue Jul 9, 2023 · 7 comments
Open

Online library filter/search API calls are not optimal #957

kelson42 opened this issue Jul 9, 2023 · 7 comments
Assignees
Milestone

Comments

@kelson42
Copy link
Collaborator

kelson42 commented Jul 9, 2023

For the moment, a new request of type "https://library.kiwix.org:443/catalog/search?lang=amh&count=-1&tag=_category:gutenberg&notag=_sw:yes" is done each time a filter is changed in the sidebar. Then everything (book details, no pagination) is downloaded from https://library.kiwix.org. The "search" (currently library top bar) is then applied a posteriori on the local result list to have a match.

I'm not satified because that generates a lot of data transfered (slow and expensive) between local computer and library.kiwix.org.

I woud like to propose:

  • Free text search is handled like any other filter and generates a new API request if changing
  • We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.
  • Libkiwix should cache properly things
    • We should have a cache of requests/response that if we have twice the same requests we identify and avoid downloading twice the results
    • Results list should be handled differently from books. Results reonses should only have the list of book id and book details should be requested (and cached) separatly. That way we can cache book details independently from the API request/response (and save bandwidth).
@kelson42
Copy link
Collaborator Author

kelson42 commented Jul 9, 2023

@mgautierfr @veloman-yunkan @juuz0 Does that make sense? Is the libkiwix ready for that?

@kelson42 kelson42 changed the title Online library filter/search are not optimal Online library filter/search API calls are not optimal Jul 9, 2023
@kelson42 kelson42 pinned this issue Aug 12, 2023
@juuz0
Copy link
Collaborator

juuz0 commented Aug 19, 2023

We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.

One interesting problem here is to handle the sorting of books.
Current kiwix-desktop applies sort on all of the list. But obviously, if we use pagination, it's only going to apply on the list available to it and give false positives (eg: sorting in descending order by size will not give the book with biggest size in Global catalog).
libkiwix should probably support a parameter to give out sorted data, if we need this.

@kelson42
Copy link
Collaborator Author

libkiwix should probably support a parameter to give out sorted data, if we need this.

Yes, we have multiple tickets regarding sorting of results. But we could handle this with kiwix/libkiwix#702. Please put the requirements you have in a comment so you can secure Kiwix-Desktop features can continue to work.

@juuz0
Copy link
Collaborator

juuz0 commented Sep 6, 2023

does the cache here have to be implemented in libkiwix?
QT has QNetworkDiskCache which can help in this

@kelson42
Copy link
Collaborator Author

kelson42 commented Sep 6, 2023

@mgautierfr We urgently need your feedback here

@mgautierfr
Copy link
Member

Free text search is handled like any other filter and generates a new API request if changing

I am somehow against that. By definition a free text filtering will only reduce the number of books and so we could do filtering locally. And so avoid data transfer between the client and the server. That why we implement it this way.

BUUUTTT if we go in the direction of using partial entries and pagination, we cannot do the filtering locally as we don't have the list of all books (and their details). So ... yes, by implementation need, all search/filtering/sorting have to be done on the server.

We don't download more book details than what we can display on the screen (but we need to know how many results we have in total). We should use result pagination.

Agree

Libkiwix should cache properly things
We should have a cache of requests/response that if we have twice the same requests we identify and avoid downloading twice the results

Not so easy. And maybe not so useful.
First of all, catalog content may changed. So we have to invalidate cache somehow.
We should measure real use case. How much (duplicated) requests is made by real user (not us testing the catalog) ?
And if we go using partial entries, even if we have duplicated requests, we may not need to spend our time on caching ALL entries.

Results list should be handled differently from books. Results reonses should only have the list of book id and book details should be requested (and cached) separatly. That way we can cache book details independently from the API request/response (and save bandwidth).

That's the purpose of partial_entries (to get list of books only) and entry (to get information about a specific book)


As suggested somewhere (I cannot find where), I think we should introduce a "RemoteLibrary" in libkiwix.
This remote library should share the same (subset) api of Library :

  • getBookById
  • getBookCount
  • getBooksLanguages
  • getBooksLanguagesWithCounts
  • getBooksCategories
  • getBooksCreators
  • getBooksPublishers
  • getBooksIds
  • filter (?)
  • sort (?)
  • filter_sorted_paginated (to create)

However, the RemoteLibrary would do request to the server for each methods (we almost already have a specific endpoint for each method)

Request and caching strategy:

We will use partial_entries endpoint to get list of books (getBooksIds and filter_sorted_paginated).
Theses methods already return bookId only (as partial_entries endpoint, so we are good).

When user call getBookById with a bookId, we do a request to entry endpoint and get the book information and create a Book. The Book itself is cached and so, no duplicate request to entry is made.

About Illustration:
OPDS already give a url to the illustration instead of the data (already to reduce the size of the opds stream).
Illustration class itself is downloading the data when we call Illustration::getData(). This is somehow duplicated with ThumbnailDownloader implemented in kiwix-desktop by @juuz0 (sorry, I missed that)

This lead to download management:

The problem here is that don't want the download to freeze the ui.

  • Easy solution:
    Don't care. If we do multiple small requests (about few kB), the io latency may be negligeable and if it is, fix it later.
  • Middle solution:
    Libkiwix don't care. It is up to kiwix-desktop (and other client ?) to call the remote library method in a different thread and not block its ui
  • Difficult solution:
    Build a two step system: methods don't return the results directly but return a tuple (url, callback). url being the url to request (by threaded requests). callback being a function to call with the response content and which actually return the "useful results".
    A blocking application would use :
auto tuple = library.getBooksIds();
auto bookIds = tuple.1(download(tuple.0));
do_stuff_with_bookids(bookIds);

A event driven application would use:

auto tuple = library.getBooksIds();
auto request = Request(tuple.0);
request.on_completion([](response) {
  auto bookIds = tuple.1(response.content);
  do_stuff_with_bookids(bookIds);
});
request.do();
do_other_things_without_waiting_request_to_finish();
  • Hard solution:
    Design a full async/callback api on libkiwix.

I would go with the easy solution and if not enough, move to the difficult solution.
(The "Easy remote library" becoming a wrapper around the "difficult remote library" with the "blocking application" code)


I have played a bit with the api:

  • full catalog opds full entry content is about 5MB
  • full catalog opds with partial entries only (only the list of books) is 1.4MB
  • 50 entries opds full entry content is 60K
  • 50 entries opds partial entries only is 18K
  • One full entry 1.2K

(All done without filter)

@kelson42
Copy link
Collaborator Author

@mgautierfr It seems we agree on the first part and I agree also on the second part of your comment but this is something I would like to handle separatly (but maybe this is a prerequisite to this issue). Therefore I have copied it to this dedicated issue at kiwix/libkiwix#1171

@kelson42 kelson42 modified the milestones: 2.5.0, 2.6.0 Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants