Optimize /api/internal/add_nodes/ #1088

aronasorman · 2018-11-06T19:58:48Z

This one currently slows down @ralphiee22 when he uploads KA.

add_nodes is one of the slowest endpoints in the sushi chef upload process.
Let's make it faster!

Step 1 is to write a benchmark script inside deploy/chaos/ that pings this
endpoint continuously.

Once we've established a baseline, then we can commence optimization. @jayoshih
recommends parallelizing the convert_data_to_nodes loop.

Not optimizable

There isn't much we can do about the download stage—the first time the chef runs, we have to download and compress the videos so it's normal if that takes a long time. Subsequent runs will not have to go through this whole process.

Optimizable

The real bottleneck is the /api/internal/add_nodes calls, of which there will be as many as the number of topic nodes in the channel. The code works by creating nodes level-by-level:

first upload the root topic (studio returns the studio_id for the root topic created, which is known as channel.ricecooker_tree.id on Studio)
upload the first level of children, by specifying the parent id obtained in step 1. Studio responds with the studio_ids for all first level children.
upload the second level of children, by setting parent to studio_id obtained in step 2.
continue for rest of levels

The reason for this level-by-level upload is because studio_ids are generated on Studio at the time of ContentNode object is created, see default id generator and ricecooker needs to know these IDs for parent nodes in order to "attach" the child nodes.

This suggest the following optimization: make ricecooker generate the studio_ids before uploading the content tree (we're using uuid.uuid4s so there shouldn't be a problem with conflicts).

If ricecooker knows the source_ids then we wouldn't need to do the upload level-by-level, and could instead upload the entire tree data as a file upload (~1G of json unzipped = ~50MB zipped) and then let studio create the tree "in one shot" --- possibly using something like delay mptt updates, etc.

I wanted flag this as the most viable long term optimization, because each call to /api/internal/add_nodes does a bunch operations that take time, and it would be hard to optimize them if the operations continue to be done node-by-node.

aronasorman · 2018-11-06T22:44:03Z

make ricecooker generate the studio_ids before uploading the content tree (we're using uuid.uuid4s so there shouldn't be a problem with conflicts).

I agree in general with this, but I'm skeptical of having the client generate the canonical ID to be saved on the DB.

It might be better to have the ricecooker generate ricecooker-local IDs, but these are only used by Studio to refer to the relationships of each nodes -- Studio can then generate the UUIDs on its side, rather than depend on ricecooker.

ivanistheone · 2018-11-06T22:52:43Z

I agree in general with this, but I'm skeptical of having the client generate the canonical ID to be saved on the DB.

Isn't that the whole point of UUIDs? (reducing the need for centralized id-generation)

It might be better to have the ricecooker generate ricecooker-local IDs, but these are only used by Studio to refer to the relationships of each nodes -- Studio can then generate the UUIDs on its side, rather than depend on ricecooker.

Yeah I thought of that too, but that would require building some sort of mapping form local-ids to studio_ids which is probably not worth it. Also this approach would not fix the level-by-level problem would still exist: we'd need to save nodes at level X, before we can create nodes at level X+1, which closes the door for bulk-create style operations.

kollivier · 2018-11-06T23:05:15Z

I can't think of any issues that could be introduced by generating UUIDs from ricecooker. We already create db objects with pre-generated or hardcoded ids, such as the garbage collection node root id.

aronasorman · 2018-11-07T00:26:24Z

I see this as a possible security bug. The main issue is overwriting already existing nodes on Studio. We can have a check first on all IDs to make sure they're not overwriting nodes on other channels. But then we need to check (on KA's case) if the IDs we're uploading don't exist yet on Studio. That smells like a performance issue for me.

Happy to discuss, maybe I'm just paranoid about having ricecooker clients generate IDs.

kollivier · 2018-11-07T00:36:45Z

If a UUID function is generating an ID that already exists, then by definition it's not generating a Universally Unique ID. :( With UUIDs, the database is not tracking what UUIDs have already been used, because it doesn't have to - the same UUID will never be generated twice.

rtibbles · 2021-04-22T23:20:35Z

Superseded by #3041

aronasorman assigned lyw07 Nov 6, 2018

rtibbles closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize /api/internal/add_nodes/ #1088

Optimize /api/internal/add_nodes/ #1088

aronasorman commented Nov 6, 2018

ivanistheone commented Nov 6, 2018 •

edited

Loading

aronasorman commented Nov 6, 2018

ivanistheone commented Nov 6, 2018

kollivier commented Nov 6, 2018

aronasorman commented Nov 7, 2018

kollivier commented Nov 7, 2018 •

edited

Loading

rtibbles commented Apr 22, 2021

Optimize /api/internal/add_nodes/ #1088

Optimize /api/internal/add_nodes/ #1088

Comments

aronasorman commented Nov 6, 2018

Category

ivanistheone commented Nov 6, 2018 • edited Loading

Not optimizable

Optimizable

aronasorman commented Nov 6, 2018

ivanistheone commented Nov 6, 2018

kollivier commented Nov 6, 2018

aronasorman commented Nov 7, 2018

kollivier commented Nov 7, 2018 • edited Loading

rtibbles commented Apr 22, 2021

ivanistheone commented Nov 6, 2018 •

edited

Loading

kollivier commented Nov 7, 2018 •

edited

Loading