Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize /api/internal/add_nodes/ #1088

Closed
aronasorman opened this issue Nov 6, 2018 · 7 comments
Closed

Optimize /api/internal/add_nodes/ #1088

aronasorman opened this issue Nov 6, 2018 · 7 comments
Assignees

Comments

@aronasorman
Copy link
Collaborator

This one currently slows down @ralphiee22 when he uploads KA.

add_nodes is one of the slowest endpoints in the sushi chef upload process.
Let's make it faster!

Step 1 is to write a benchmark script inside deploy/chaos/ that pings this
endpoint continuously.

Once we've established a baseline, then we can commence optimization. @jayoshih
recommends parallelizing the convert_data_to_nodes loop.

Category

ENHANCEMENT

@ivanistheone
Copy link
Contributor

ivanistheone commented Nov 6, 2018

@lyw07 @aronasorman @ralphiee22 Here are some notes about possible optimizations for the ricecooker-->studio process.

Here is a time diagram from sushibar about a Khan Academy chef run:
screen shot 2018-11-06 at 4 45 15 pm

Not optimizable

There isn't much we can do about the download stage—the first time the chef runs, we have to download and compress the videos so it's normal if that takes a long time. Subsequent runs will not have to go through this whole process.

Optimizable

The real bottleneck is the /api/internal/add_nodes calls, of which there will be as many as the number of topic nodes in the channel. The code works by creating nodes level-by-level:

  1. first upload the root topic (studio returns the studio_id for the root topic created, which is known as channel.ricecooker_tree.id on Studio)
  2. upload the first level of children, by specifying the parent id obtained in step 1. Studio responds with the studio_ids for all first level children.
  3. upload the second level of children, by setting parent to studio_id obtained in step 2.
  4. continue for rest of levels

The reason for this level-by-level upload is because studio_ids are generated on Studio at the time of ContentNode object is created, see default id generator and ricecooker needs to know these IDs for parent nodes in order to "attach" the child nodes.

This suggest the following optimization: make ricecooker generate the studio_ids before uploading the content tree (we're using uuid.uuid4s so there shouldn't be a problem with conflicts).

If ricecooker knows the source_ids then we wouldn't need to do the upload level-by-level, and could instead upload the entire tree data as a file upload (~1G of json unzipped = ~50MB zipped) and then let studio create the tree "in one shot" --- possibly using something like delay mptt updates, etc.

I wanted flag this as the most viable long term optimization, because each call to /api/internal/add_nodes does a bunch operations that take time, and it would be hard to optimize them if the operations continue to be done node-by-node.

@aronasorman
Copy link
Collaborator Author

make ricecooker generate the studio_ids before uploading the content tree (we're using uuid.uuid4s so there shouldn't be a problem with conflicts).

I agree in general with this, but I'm skeptical of having the client generate the canonical ID to be saved on the DB.

It might be better to have the ricecooker generate ricecooker-local IDs, but these are only used by Studio to refer to the relationships of each nodes -- Studio can then generate the UUIDs on its side, rather than depend on ricecooker.

@ivanistheone
Copy link
Contributor

I agree in general with this, but I'm skeptical of having the client generate the canonical ID to be saved on the DB.

Isn't that the whole point of UUIDs? (reducing the need for centralized id-generation)

It might be better to have the ricecooker generate ricecooker-local IDs, but these are only used by Studio to refer to the relationships of each nodes -- Studio can then generate the UUIDs on its side, rather than depend on ricecooker.

Yeah I thought of that too, but that would require building some sort of mapping form local-ids to studio_ids which is probably not worth it. Also this approach would not fix the level-by-level problem would still exist: we'd need to save nodes at level X, before we can create nodes at level X+1, which closes the door for bulk-create style operations.

@kollivier
Copy link
Contributor

I can't think of any issues that could be introduced by generating UUIDs from ricecooker. We already create db objects with pre-generated or hardcoded ids, such as the garbage collection node root id.

@aronasorman
Copy link
Collaborator Author

I see this as a possible security bug. The main issue is overwriting already existing nodes on Studio. We can have a check first on all IDs to make sure they're not overwriting nodes on other channels. But then we need to check (on KA's case) if the IDs we're uploading don't exist yet on Studio. That smells like a performance issue for me.

Happy to discuss, maybe I'm just paranoid about having ricecooker clients generate IDs.

@kollivier
Copy link
Contributor

kollivier commented Nov 7, 2018

If a UUID function is generating an ID that already exists, then by definition it's not generating a Universally Unique ID. :( With UUIDs, the database is not tracking what UUIDs have already been used, because it doesn't have to - the same UUID will never be generated twice.

@rtibbles
Copy link
Member

Superseded by #3041

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants