Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be polite with external services #60

Open
ptagliolato opened this issue Jun 28, 2022 · 12 comments
Open

Be polite with external services #60

ptagliolato opened this issue Jun 28, 2022 · 12 comments

Comments

@ptagliolato
Copy link
Collaborator

consider to create a caching mechanism for queries against external services in order to reduce the amount of requests.
This could be done by incapsulating the logics of do_Q and subsequent statements (e.g functions like "get_activity_info") in one function accepting e.g. the id and type of an entity: the function could implement the caching mechanism.
We could also consider packages like rcache to implement the mechanism.

@oggioniale
Copy link
Collaborator

please consider also this https://blog.r-hub.io/2021/07/30/cache/

@ptagliolato
Copy link
Collaborator Author

Two points on this issue.
The caching mechanism could be better implemented in the get_id function. This requires some work for managing internal and system variable and exactly define the flow. I stopped myself on this way after first steps in a branch on my fork. The point is that this sort of local cache could be useful to speed up one's work, but it does not fix the current issue,

e.g. during workshops.

A different approach would be that of implementing a reverse-proxy with caching capabilities. I installed on CNR-IREA internal swarm server a version (with some customization) of the docker image from https://github.com/djmaze/docker-caching-proxy.

I made some benchmark by setting the ReLTER::set_deims_base_url("") and invoking ReLTER::get_ilter_observedProperties(sitesNum = 100) and first results seem promising (50% time of the first -not cached- request).
I'm not an nginx expert and I'm sure it should be fine tuned, but it seems to me a possible strategy to ameliorate the workshop scenario.

@micha-silver
Copy link
Contributor

Hi Paolo:
Thanks for revisiting this. Maybe I am not following, but what is the use case for caching queries? Do we expect multiple queries from one user for the same data? This might happen in a workshop framework, as you mention, but that's a special case, no?

In a broader view, the audience for the ReLTER package would be researchers working on a particular site. They will generally get the site metadata from DEIMS once, then go on to search for real data, elsewhere. Is that your view also?
If that is indeed the intention, then why do we need, i.e. get_ilter_generalinfo() to return a full list of all sites on DEIMS? (> 1000 sites). Why should we allow that query to run at all without at least a country_name or a site_name ??

And more importantly, following Christoph's request, we need to always use his API to get site data and never request the whole list of sites. Unless I am missing something, we have not yet implemented that. If I am wrong, please correct me!

@stopopol
Copy link

stopopol commented Jan 9, 2023

Hi,

I agree with Micha as I struggling to see the use case here. I think that if we optimise the way the API is queried as well as add additional filter options for the API, then we should have covered most (potential) performance issues.

Setting up a dedicated caching layers seems to be a little too much. Most users are probably not interested in querying all site records and their contents at once.

We can, of course, discuss this during the upcoming eLTER Software workshop.

Best,
Christoph

@oggioniale
Copy link
Collaborator

Dear @ptagliolato, @micha-silver and @stopopol.
The caching mechanism is not only for the large query but for improve the performances of all the queries from those dedicated to a single site (get_site_info()) to those dedicated to a network (get_network_observedProperties(), get_network_sites()). I thinking, for example, to have 100 simultaneous calls to obtain information from the list of sites of the LTER-Italia network.

The function get_ilter_generalinfo() was designed to get an overall idea of the global network and to answer network management questions (e.g. what is the altimetric distribution of sites in the ILTER network, what is the distribution of sites with respect to biogeographical regions, etc.). These queries will be made with low temporal frequency and during the training could be made simultaneously by a high number (20?) of people. In both cases, caching could be a good solution.

If we feel that the functions dedicated to ILTER are not important, we can also delete them, but before the workshop we cannot touch them.

@stopopol
Copy link

stopopol commented Jan 9, 2023

I've just had a look at the get_ilter_generalinfo function, but the description is not entirely correct (https://docs.ropensci.org/ReLTER/reference/get_ilter_generalinfo.html). There are only ~750 ILTER sites. If the function returns much more than that, it's actually querying all of DEIMS. Looking at the source code (https://github.com/ropensci/ReLTER/blob/main/R/get_ilter_generalinfo.R) the function is actually doing that. The query parameter should be something like this: https://deims.org/api/sites?network=1aa7ccb2-a14b-43d6-90ac-5e0a6bc1d65b&verified=true

EDIT: This also applies to the other ILTER functions, e.g. https://github.com/ropensci/ReLTER/blob/main/R/get_ilter_observedProperties.R

The functions are querying all of DEIMS instead of only querying the ILTER sites. Please update these functions.

EDIT 2: From what I can see, the functions also don't capitalise on the filter functionality of the API. The API can be filtered by site name (https://deims.org/api/sites?name=lago) and country (https://deims.org/api/sites?country=it).

@ptagliolato
Copy link
Collaborator Author

@stopopol Last commits (see #83) exploits the DEIMS API filter functionality you suggested (get_ilter_generalinfo). Moreover: it is now mandatory to set at least one of the two filters; the function now avoids any other request per-site (with the exception of cases where the first response from /sites/? returns only 3 sites, this is for temporary compatibility of old tests). I hope this solves the issue on the first function. Other get_ilter* functions must be modified or discarded (their status "questioning" accounts for this).

@micha-silver
Copy link
Contributor

@ptagliolato : mentioning here so I won't forget :-)
I think we need a function similar to get_ilter_generalinfo() to query for networks. IIUC, There is, so far, no way to get the DEIMS ID of a network. Maybe the same function with an additional parameter?

@ptagliolato
Copy link
Collaborator Author

In order not to do the same mistake in communication, I would involve @stopopol and ask him for an appropriate API to retrieve network-only-related info (networkId, name, related networks for example). We can further discuss this online.

@micha-silver
Copy link
Contributor

@ptagliolato Good idea. But it will be after Lyon in any case.

@stopopol
Copy link

We can discuss feature requests in detail in Lyon or other dedicated (virtual) meetings. In principle, if you need additional filter and query functionality for the rest-api I'll be happy to implement but it depends on the field/entity type. networks should be more straightforward to do but there are different ways to realise it.

@micha-silver
Copy link
Contributor

@stopopol Looking at the workshop agenda, the best window for a meeting among us will be Wednesday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants