Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_*() consistency discussion #193

Closed
Aariq opened this issue Jan 8, 2020 · 23 comments · Fixed by #242
Closed

get_*() consistency discussion #193

Aariq opened this issue Jan 8, 2020 · 23 comments · Fixed by #242
Assignees
Labels
consistent api Uniformity across functions and data sources
Milestone

Comments

@Aariq
Copy link
Collaborator

Aariq commented Jan 8, 2020

Currently webchem has get_ functions for 4 databases to retrieve database-specific IDs that can then be used as input to other functions. These all have different interfaces.

get_cid(query, from = "name", first = FALSE, verbose = TRUE,
  arg = NULL, ...)

get_csid(query, apikey, control = cs_control())

get_etoxid(query, match = c("best", "all", "first", "ask", "na"),
  verbose = TRUE)

get_wdid(query, language = "en", match = c("best", "first", "all",
  "ask", "na"), verbose = TRUE)

Let's use this as a starting point to discuss consistency of the API across databases. Personally, my favorite arguments that I think should be in all of the get_ functions are:

  • query for obvious reasons should always be the first argument.
  • the from argument let's you specify whether the input is a name, inchikey, etc. and the documentation specifies what kinds of queries are valid for that particular database.
  • the first argument lets you get a vector out, even if multiple IDs are returned. Vectors of IDs are integral for using in other webchem functions that return dataframes. I like the match argument in get_wdid too, but how often are you really going to use "ask" for ID numbers? How are you supposed to know which is the right one beforehand?
  • ..., although I think they should pass arguments to an underlying control function like cs_control() (that is, I don't think get_cid should have both arg and ...)
  • database-specific arguments like language for get_wdid or apikey for get_csid should go at the end (but before ...)

Anyway, happy to hear what y'all think. Hopefully this will be an easy-ish jumping off point for making the API more consistent across databases.

@Aariq
Copy link
Collaborator Author

Aariq commented Jan 8, 2020

related: #191, #157

@Aariq Aariq added the consistent api Uniformity across functions and data sources label Jan 8, 2020
@stitam stitam added this to the v0.6.0 milestone Jan 23, 2020
@Aariq
Copy link
Collaborator Author

Aariq commented Jan 29, 2020

Also related: #206

@andschar
Copy link
Contributor

We could also turn chebi_lite_entity() into a get_ function, since it is basically one, returning possible matches as e.g. get_cid().

@andschar
Copy link
Contributor

andschar commented Jan 29, 2020

I also think that from = and first = are important arguments. #206

@Aariq
Copy link
Collaborator Author

Aariq commented Jan 29, 2020

Another (maybe bad) idea is to get rid of get_ functions entirely, or just use them internally. Unless we think users will want to get database specific IDs without any other data, why use them? Instead, functions like cs_compinfo() could accept multiple types of queries rather than just CSIDs, and return a dataframe with the query and the CSID as columns.

@andschar
Copy link
Contributor

andschar commented Jan 31, 2020

I think the get_ functions are not bad at all. For the last big project, I was first collecting different identifiers with these functions and then used the ids to query additional data.

@stitam
Copy link
Contributor

stitam commented Feb 2, 2020

I like the way etoxids are currently returned, the function will always returns a data frame, but the contents of the dataframe will depend on match type. This way if you want a vector of ids, you can just reference the id column of the dataframe, however if you want to know more about your results, you can view the whole data frame. If you want only one id per query, you can adjust match to "best", "first" or "ask". I think this gives more control than the first argument, what do you think? Match types are also implemented in get_wdid(), only there if you use match = "all" you will get a list.

@Aariq
Copy link
Collaborator Author

Aariq commented Feb 5, 2020

I like the match argument as well, but not all functions have a "best" option (or it wouldn't be different than the "first" option). I also agree that dataframes are a lot easier to work with than lists, and I think we should aim to have all the get_ functions return dataframes. Maybe with a warning or message when more than one result is returned per element of the query. Warning: More than one ID returned for some queries. Use match = "first" or "ask" to return only one result per query

@stitam
Copy link
Contributor

stitam commented Feb 6, 2020

I saw the functionality of the match argument duplicated in multiple functions. Maybe we could modify the get_ functions to always return a data frame of "all" results, but print the warning message when multiple results are returned and in the meantime we could export the functionality of the match argument in a separate function that takes a data frame as its input and removes duplicates based on various strategies, "best", "first", etc.

@Aariq
Copy link
Collaborator Author

Aariq commented Feb 6, 2020

Yeah, that was kind of the idea with the chooser() function in utils.R. I was hoping that it could be called by any function where you might want all the results, only the first result, or pick a result. I'd be open to changing the behavior of chooser() to be more like match = c("best", "first", "all") though.

I wanted this functionality to be in pc_synonyms() rather than a separate step because it lets you use functions inside of mutate() because the output is always the same length as the input if match = "ask" or match = "first". I actually do things like this quite a bit:

df %>%
  mutate(name = pc_synonyms(CAS_number, choices = 1))

But I suppose if get_ functions returned dataframes, this kind of mutate would be unnecessary. If you get multiple results for each CAS_number, you could just filter things out using dplyr() tools after the fact.

@Aariq Aariq closed this as completed Feb 6, 2020
@Aariq Aariq reopened this Feb 6, 2020
@stitam
Copy link
Contributor

stitam commented Feb 7, 2020

In PR #203 I used chooser(), but only for match == "ask", although I see that I could have used it for match == "first" as well. I think it would be great to extend chooser() to handle all filtering strategies and use it as a generic function. Should it be exported? I think if we use it only within other functions then no, otherwise yes. Related to #194.

I think the benefit of making the get_ functions always return "all" results and then filtering with a separate function is that it allows us to put less pressure on the servers. We might run a search with match = "best" then decide match = "first" would be better, then switch back to match = "best", etc. Each time we call the function, we query the server, which is not polite. With a separate filtering function we could query the server only once for "all" results, and then filter the results however we like. It wouldn't make a difference in your example @Aariq, but I think it could keep your code simple, e.g.

df %>% mutate(name = pc_synonyms(CAS_number) %>% chooser("first"))

Even if filtering the get_ results can be performed by dplyr() tools, I think it would be great to put them into a dedicated filtering function like chooser(), that takes a get_ result and accepts a set of filtering strategies. Of course if query results are unique then chooser() would return the same thing, regardless of the choice. Would this work in your workflows? Maybe chooser could be set up to return either a data frame or a vector of ids, with ids as default?

@stitam stitam mentioned this issue Feb 10, 2020
2 tasks
@andschar
Copy link
Contributor

Have you decided on whether data.frame's should be returned by the get_*() functions? Sorry it's a bit late but I would opt for list of data.frame's. They are easy to bind with data.table::rbindlist(l), dplyr::bind_rows(l) or base::do.call(rbind, l) and would directly show the user which query returned only a single result and which one returned multiple ones.

What is bad in my opinion, and I think we all agree is to return a list of vectors, b/c it'd need some tweeking to get it into a data.frame.

So, instead of directly a data.frame, I would prefer a list of data.frames.

@andschar
Copy link
Contributor

Concerning the chooser() function, I personally would always just return all the results and "choose" how to filter them with my own R data wrangling tools. So @stitam I disagree to have a separate chooser() function. I think for users it would just overhead for tools that are already there.

@Aariq
Copy link
Collaborator Author

Aariq commented Feb 26, 2020

Have you decided on whether data.frame's should be returned by the get_*() functions? Sorry it's a bit late but I would opt for list of data.frame's.

I think I will start a separate discussion on this because there is a huge variation in what is returned by get_ and query functions across the package.

edit: continued in #218

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 6, 2020

How's this for a get_* archetype:

get_*(query, from = c("name", "cas", "inchikey"), match = c("best", "first", "all", "ask", "na"), verbose = TRUE, other_args, ... )

Where from may or may not be appropriate (e.g. if there is no advanced search for a database, or you can only search by CAS), match might not include "best", other_args are any database-specific arguments like language = for wikidata, and ... is arguments passed to a control function like cs_control()

I realize that match= is controversial (see also discussion over at #218 by @andschar). I don't really see how match= makes any difference for server use. Functions are already doing filtering after downloading the data (I think), and if users want to do their own filtering, that's what match = "all" is for. Also, match = "best", "first" or "ask" is very convenient for users who only want one ID per compound. I vote we keep it.

EDIT: I included match = "na" as well. This returns NA when there is more than one match, which seems potentially useful for all get_*() functions.

@stitam
Copy link
Contributor

stitam commented Mar 16, 2020

if we only use get_ once, it doesn't matter for the server which match we use. It will always query all results and process them according to match type. However, we might not be happy with the output, so we might run the query again with a different match type. And that's where it makes a difference for the server. On the other hand, if the function always queries all results, then we will always query the server once and play with processing locally. Maybe we don't have to decide now. We can keep match in our get_ type functions, but also create a utility function that processes match="all" outputs and aim to make that function universal. After this alternative is implemented, we can discuss if we want to keep both or not. This would be better from the CI perspective as well. What do you think?

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 17, 2020

@stitam But this only applies to match = "first", right? If you aren't happy with the first match, then you might run the function a second time with match = "ask". You shouldn't be needing to run get_* functions more than twice if match = "ask" and match = "all exist. I think benefits of having all the match arguments outweigh the costs of possibly querying the server a couple of extra times for the same compound.

@stitam
Copy link
Contributor

stitam commented Mar 20, 2020

You are probably right and I of course agree that the match functionality is quite useful. The aspect which I haven't emphasized yet is that for every get_* function we have to implement a separate solution to handle all the match alternatives, which makes the functions more complex and may open up opportunities for errors and inconsistencies. I really don't know if outsourcing the match functionality to a separate function would help, but I think it would be worth to give it a try. This separate function could be called from the get_* functions through the match argument, this way the user interface would remain unchanged.

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 20, 2020

Let's re-purpose chooser() to do what we want match = to do and then we can re-use that utility for all get_* functions and possibly other functions with match-like arguments.

@stitam
Copy link
Contributor

stitam commented Mar 25, 2020

While adding guidelines for handling invalid queries in PR #225, I also added some guidelines for the get_*() functions as I believe we have reached consensus on most of this topic. The only thing left to discuss is the columns of the tibbles and the re-purposing of chooser().

@Aariq
Copy link
Collaborator Author

Aariq commented Mar 26, 2020

How about this:

get_exampleid(query = "110-93-0", from = "cas", match = "all")

# A tibble: 2 x 2
  query    exampleid
  <chr>    <chr>    
1 110-93-0 12345    
2 110-93-0 12345-1  

The query type (i.e. from) could be stored as an attribute.

For anything that might not have an exact match we should also show the match to the query:

# A tibble: 1 x 3
  query     match               etoxid
  <chr>     <chr>               <chr> 
1 Triclosan Triclosan ( 20179 ) 20179 

Other columns, can exist after these, for example the searchscore and stars returned by get_chebiid().

@stitam
Copy link
Contributor

stitam commented Mar 26, 2020

I think this is perfect.

@Aariq Aariq self-assigned this Apr 15, 2020
@Aariq
Copy link
Collaborator Author

Aariq commented Apr 15, 2020

get functions to fix:

  • get_cid

  • get_csid

  • get_etoxid

  • get_wdid

  • get_chebiid

  • Change all tests and examples including readme to reflect changes in workflows

@Aariq Aariq mentioned this issue Apr 22, 2020
1 task
@Aariq Aariq linked a pull request Apr 23, 2020 that will close this issue
1 task
stitam added a commit that referenced this issue Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistent api Uniformity across functions and data sources
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants