get_*() consistency discussion #193

Aariq · 2020-01-08T21:52:10Z

Currently webchem has get_ functions for 4 databases to retrieve database-specific IDs that can then be used as input to other functions. These all have different interfaces.

get_cid(query, from = "name", first = FALSE, verbose = TRUE,
  arg = NULL, ...)

get_csid(query, apikey, control = cs_control())

get_etoxid(query, match = c("best", "all", "first", "ask", "na"),
  verbose = TRUE)

get_wdid(query, language = "en", match = c("best", "first", "all",
  "ask", "na"), verbose = TRUE)

Let's use this as a starting point to discuss consistency of the API across databases. Personally, my favorite arguments that I think should be in all of the get_ functions are:

query for obvious reasons should always be the first argument.
the from argument let's you specify whether the input is a name, inchikey, etc. and the documentation specifies what kinds of queries are valid for that particular database.
the first argument lets you get a vector out, even if multiple IDs are returned. Vectors of IDs are integral for using in other webchem functions that return dataframes. I like the match argument in get_wdid too, but how often are you really going to use "ask" for ID numbers? How are you supposed to know which is the right one beforehand?
..., although I think they should pass arguments to an underlying control function like cs_control() (that is, I don't think get_cid should have both arg and ...)
database-specific arguments like language for get_wdid or apikey for get_csid should go at the end (but before ...)

Anyway, happy to hear what y'all think. Hopefully this will be an easy-ish jumping off point for making the API more consistent across databases.

The text was updated successfully, but these errors were encountered:

Aariq · 2020-01-08T21:53:49Z

related: #191, #157

Aariq · 2020-01-29T13:27:00Z

Also related: #206

andschar · 2020-01-29T15:22:02Z

We could also turn chebi_lite_entity() into a get_ function, since it is basically one, returning possible matches as e.g. get_cid().

andschar · 2020-01-29T15:23:59Z

I also think that from = and first = are important arguments. #206

Aariq · 2020-01-29T21:37:15Z

Another (maybe bad) idea is to get rid of get_ functions entirely, or just use them internally. Unless we think users will want to get database specific IDs without any other data, why use them? Instead, functions like cs_compinfo() could accept multiple types of queries rather than just CSIDs, and return a dataframe with the query and the CSID as columns.

andschar · 2020-01-31T15:12:59Z

I think the get_ functions are not bad at all. For the last big project, I was first collecting different identifiers with these functions and then used the ids to query additional data.

stitam · 2020-02-02T09:51:47Z

I like the way etoxids are currently returned, the function will always returns a data frame, but the contents of the dataframe will depend on match type. This way if you want a vector of ids, you can just reference the id column of the dataframe, however if you want to know more about your results, you can view the whole data frame. If you want only one id per query, you can adjust match to "best", "first" or "ask". I think this gives more control than the first argument, what do you think? Match types are also implemented in get_wdid(), only there if you use match = "all" you will get a list.

Aariq · 2020-02-05T14:47:56Z

I like the match argument as well, but not all functions have a "best" option (or it wouldn't be different than the "first" option). I also agree that dataframes are a lot easier to work with than lists, and I think we should aim to have all the get_ functions return dataframes. Maybe with a warning or message when more than one result is returned per element of the query. Warning: More than one ID returned for some queries. Use match = "first" or "ask" to return only one result per query

stitam · 2020-02-06T21:17:05Z

I saw the functionality of the match argument duplicated in multiple functions. Maybe we could modify the get_ functions to always return a data frame of "all" results, but print the warning message when multiple results are returned and in the meantime we could export the functionality of the match argument in a separate function that takes a data frame as its input and removes duplicates based on various strategies, "best", "first", etc.

Aariq · 2020-02-06T22:49:37Z

Yeah, that was kind of the idea with the chooser() function in utils.R. I was hoping that it could be called by any function where you might want all the results, only the first result, or pick a result. I'd be open to changing the behavior of chooser() to be more like match = c("best", "first", "all") though.

I wanted this functionality to be in pc_synonyms() rather than a separate step because it lets you use functions inside of mutate() because the output is always the same length as the input if match = "ask" or match = "first". I actually do things like this quite a bit:

df %>%
  mutate(name = pc_synonyms(CAS_number, choices = 1))

But I suppose if get_ functions returned dataframes, this kind of mutate would be unnecessary. If you get multiple results for each CAS_number, you could just filter things out using dplyr() tools after the fact.

stitam · 2020-02-07T15:22:57Z

In PR #203 I used chooser(), but only for match == "ask", although I see that I could have used it for match == "first" as well. I think it would be great to extend chooser() to handle all filtering strategies and use it as a generic function. Should it be exported? I think if we use it only within other functions then no, otherwise yes. Related to #194.

I think the benefit of making the get_ functions always return "all" results and then filtering with a separate function is that it allows us to put less pressure on the servers. We might run a search with match = "best" then decide match = "first" would be better, then switch back to match = "best", etc. Each time we call the function, we query the server, which is not polite. With a separate filtering function we could query the server only once for "all" results, and then filter the results however we like. It wouldn't make a difference in your example @Aariq, but I think it could keep your code simple, e.g.

df %>% mutate(name = pc_synonyms(CAS_number) %>% chooser("first"))

Even if filtering the get_ results can be performed by dplyr() tools, I think it would be great to put them into a dedicated filtering function like chooser(), that takes a get_ result and accepts a set of filtering strategies. Of course if query results are unique then chooser() would return the same thing, regardless of the choice. Would this work in your workflows? Maybe chooser could be set up to return either a data frame or a vector of ids, with ids as default?

andschar · 2020-02-13T14:09:00Z

Have you decided on whether data.frame's should be returned by the get_*() functions? Sorry it's a bit late but I would opt for list of data.frame's. They are easy to bind with data.table::rbindlist(l), dplyr::bind_rows(l) or base::do.call(rbind, l) and would directly show the user which query returned only a single result and which one returned multiple ones.

What is bad in my opinion, and I think we all agree is to return a list of vectors, b/c it'd need some tweeking to get it into a data.frame.

So, instead of directly a data.frame, I would prefer a list of data.frames.

andschar · 2020-02-13T14:18:35Z

Concerning the chooser() function, I personally would always just return all the results and "choose" how to filter them with my own R data wrangling tools. So @stitam I disagree to have a separate chooser() function. I think for users it would just overhead for tools that are already there.

Aariq · 2020-02-26T22:19:04Z

Have you decided on whether data.frame's should be returned by the get_*() functions? Sorry it's a bit late but I would opt for list of data.frame's.

I think I will start a separate discussion on this because there is a huge variation in what is returned by get_ and query functions across the package.

edit: continued in #218

Aariq · 2020-03-06T21:27:45Z

How's this for a get_* archetype:

get_*(query, from = c("name", "cas", "inchikey"), match = c("best", "first", "all", "ask", "na"), verbose = TRUE, other_args, ... )

Where from may or may not be appropriate (e.g. if there is no advanced search for a database, or you can only search by CAS), match might not include "best", other_args are any database-specific arguments like language = for wikidata, and ... is arguments passed to a control function like cs_control()

I realize that match= is controversial (see also discussion over at #218 by @andschar). I don't really see how match= makes any difference for server use. Functions are already doing filtering after downloading the data (I think), and if users want to do their own filtering, that's what match = "all" is for. Also, match = "best", "first" or "ask" is very convenient for users who only want one ID per compound. I vote we keep it.

EDIT: I included match = "na" as well. This returns NA when there is more than one match, which seems potentially useful for all get_*() functions.

stitam · 2020-03-16T16:39:10Z

if we only use get_ once, it doesn't matter for the server which match we use. It will always query all results and process them according to match type. However, we might not be happy with the output, so we might run the query again with a different match type. And that's where it makes a difference for the server. On the other hand, if the function always queries all results, then we will always query the server once and play with processing locally. Maybe we don't have to decide now. We can keep match in our get_ type functions, but also create a utility function that processes match="all" outputs and aim to make that function universal. After this alternative is implemented, we can discuss if we want to keep both or not. This would be better from the CI perspective as well. What do you think?

Aariq · 2020-03-17T20:22:32Z

@stitam But this only applies to match = "first", right? If you aren't happy with the first match, then you might run the function a second time with match = "ask". You shouldn't be needing to run get_* functions more than twice if match = "ask" and match = "all exist. I think benefits of having all the match arguments outweigh the costs of possibly querying the server a couple of extra times for the same compound.

stitam · 2020-03-20T10:57:02Z

You are probably right and I of course agree that the match functionality is quite useful. The aspect which I haven't emphasized yet is that for every get_* function we have to implement a separate solution to handle all the match alternatives, which makes the functions more complex and may open up opportunities for errors and inconsistencies. I really don't know if outsourcing the match functionality to a separate function would help, but I think it would be worth to give it a try. This separate function could be called from the get_* functions through the match argument, this way the user interface would remain unchanged.

Aariq · 2020-03-20T19:46:01Z

Let's re-purpose chooser() to do what we want match = to do and then we can re-use that utility for all get_* functions and possibly other functions with match-like arguments.

stitam · 2020-03-25T12:59:46Z

While adding guidelines for handling invalid queries in PR #225, I also added some guidelines for the get_*() functions as I believe we have reached consensus on most of this topic. The only thing left to discuss is the columns of the tibbles and the re-purposing of chooser().

Aariq · 2020-03-26T14:57:09Z

How about this:

get_exampleid(query = "110-93-0", from = "cas", match = "all")

# A tibble: 2 x 2
  query    exampleid
  <chr>    <chr>    
1 110-93-0 12345    
2 110-93-0 12345-1

The query type (i.e. from) could be stored as an attribute.

For anything that might not have an exact match we should also show the match to the query:

# A tibble: 1 x 3
  query     match               etoxid
  <chr>     <chr>               <chr> 
1 Triclosan Triclosan ( 20179 ) 20179

Other columns, can exist after these, for example the searchscore and stars returned by get_chebiid().

stitam · 2020-03-26T15:37:15Z

I think this is perfect.

Aariq · 2020-04-15T02:28:39Z

get functions to fix:

Closes #193.

Aariq added the consistent api Uniformity across functions and data sources label Jan 8, 2020

stitam added this to the v0.6.0 milestone Jan 23, 2020

stitam mentioned this issue Feb 2, 2020

Upgrade get_csid() to accept more input types, etc. #208

Merged

4 tasks

Aariq closed this as completed Feb 6, 2020

Aariq reopened this Feb 6, 2020

stitam mentioned this issue Feb 10, 2020

copyediting documentation #210

Merged

2 tasks

Aariq mentioned this issue Feb 26, 2020

option in get_csid() to return named vector rather than list #191

Closed

andschar mentioned this issue Feb 29, 2020

Standardizing what is returned by functions #218

Open

andschar mentioned this issue Mar 24, 2020

Functions handle NAs in input incorrectly #224

Closed

Aariq self-assigned this Apr 15, 2020

stitam mentioned this issue Apr 18, 2020

New functions for the PubChem PUG-REST API #239

Merged

4 tasks

Aariq mentioned this issue Apr 22, 2020

get_*() consistency #242

Merged

1 task

Aariq linked a pull request Apr 23, 2020 that will close this issue

get_*() consistency #242

Merged

1 task

stitam closed this as completed in #242 Apr 28, 2020

stitam added a commit that referenced this issue Apr 28, 2020

Merge pull request #242 from Aariq/get_consistency

73dd2cb

Closes #193.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_*() consistency discussion #193

get_*() consistency discussion #193

Aariq commented Jan 8, 2020 •

edited

Loading

Aariq commented Jan 8, 2020

Aariq commented Jan 29, 2020

andschar commented Jan 29, 2020

andschar commented Jan 29, 2020 •

edited

Loading

Aariq commented Jan 29, 2020

andschar commented Jan 31, 2020 •

edited

Loading

stitam commented Feb 2, 2020

Aariq commented Feb 5, 2020

stitam commented Feb 6, 2020

Aariq commented Feb 6, 2020

stitam commented Feb 7, 2020

andschar commented Feb 13, 2020

andschar commented Feb 13, 2020

Aariq commented Feb 26, 2020 •

edited

Loading

Aariq commented Mar 6, 2020 •

edited

Loading

stitam commented Mar 16, 2020

Aariq commented Mar 17, 2020

stitam commented Mar 20, 2020 •

edited

Loading

Aariq commented Mar 20, 2020

stitam commented Mar 25, 2020

Aariq commented Mar 26, 2020

stitam commented Mar 26, 2020

Aariq commented Apr 15, 2020 •

edited

Loading

get_*() consistency discussion #193

get_*() consistency discussion #193

Comments

Aariq commented Jan 8, 2020 • edited Loading

Aariq commented Jan 8, 2020

Aariq commented Jan 29, 2020

andschar commented Jan 29, 2020

andschar commented Jan 29, 2020 • edited Loading

Aariq commented Jan 29, 2020

andschar commented Jan 31, 2020 • edited Loading

stitam commented Feb 2, 2020

Aariq commented Feb 5, 2020

stitam commented Feb 6, 2020

Aariq commented Feb 6, 2020

stitam commented Feb 7, 2020

andschar commented Feb 13, 2020

andschar commented Feb 13, 2020

Aariq commented Feb 26, 2020 • edited Loading

Aariq commented Mar 6, 2020 • edited Loading

stitam commented Mar 16, 2020

Aariq commented Mar 17, 2020

stitam commented Mar 20, 2020 • edited Loading

Aariq commented Mar 20, 2020

stitam commented Mar 25, 2020

Aariq commented Mar 26, 2020

stitam commented Mar 26, 2020

Aariq commented Apr 15, 2020 • edited Loading

Aariq commented Jan 8, 2020 •

edited

Loading

andschar commented Jan 29, 2020 •

edited

Loading

andschar commented Jan 31, 2020 •

edited

Loading

Aariq commented Feb 26, 2020 •

edited

Loading

Aariq commented Mar 6, 2020 •

edited

Loading

stitam commented Mar 20, 2020 •

edited

Loading

Aariq commented Apr 15, 2020 •

edited

Loading