Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ricecooker Studio upload optimizations #231

Closed
ralphiee22 opened this issue Nov 14, 2019 · 5 comments
Closed

Ricecooker Studio upload optimizations #231

ralphiee22 opened this issue Nov 14, 2019 · 5 comments
Milestone

Comments

@ralphiee22
Copy link
Contributor

  • ricecooker version: v1.0.0?

Description

This issue serves as a reminder/discussion for possible avenues to ricecooker optimizations.
Some things suggested already were:

  • compressing/gzipping tree metadata before uploading to studio (alternative to chunking requests)
  • generating content databases on the ricecooker side and uploading to studio, whereby studio can import into the main database
@ivanistheone
Copy link
Contributor

+1 for this. There is no reason for it to take so long...

I know @lyw07 had previously thought about performance improvements for ricecooker along these lines.

The current implementation makes repeated calls to /api/internal/add_nodes which takes hours to upload large channels. The node metadata upload happens in small chunks intentionally to avoid network timeout.

Here is one possible way we could implement a "bulk upload" path:

  1. Ricecooker run proceeds as usual, until it reaches this line which gets replaced with conditional:
     if num_nodes < 2000:
          self.add_nodes(root, self.channel)
     else:
          self.bulk_add_nodes(root, self.channel)
  1. In bulk_add_nodes, ricecooker creates a json tree of the entire channel (200MB channel.json), compresses with gzip (20MB channel.json.gz) and uploads it to a new endpoint, /api/internal/bulk_add_nodes. The endpoint responds immediately with a task_id of some sort.

  2. Studio then unzips the channel json gz and creates the studio channel tree (can take ~10 minutes). Meanwhile ricecooker is still blocked in the bulk_add_nodes function and polls a task status endpoint to check on the task progress.

  3. Once the bulk_add_nodes task completes ricecooker continues on to finish_channel as usual.

I'm not sure how the Studio tasks API works so maybe @kollivier can comment about feasibility/suitability for this purpose. The main thing is that processing the channel tree after the POST to /api/internal/bulk_add_nodes will take longer than 1 minute so needs to be done outside of the request-response cycle.

@jayoshih
Copy link
Contributor

Alternatively, we could write to a sqlite db and read that (which might fit in nicely if we ever wanted to integrate the tools with Kolibri directly)

@ivanistheone
Copy link
Contributor

Here is proof of concept for how a bulk_add_nodes could work: ivanistheone@8f66b0a

and session log:

Creating tree on Kolibri Studio...
   Creating channel Sample Ricecooker Channel
	Preparing fields...

1. Saved studio json tree to chefdata/trees/studio_json_tree.json
2. Compressing .... chefdata/trees/studio_json_tree.json
3. Bulk uploading chefdata/trees/studio_json_tree.json.gz to Studio /api/internal/bulk_add_nodes
4.     checking task status...
       checking task status...
       checking task status...
5. Done. (continuing as ricecooker process as usual)

@ralphiee22
Copy link
Contributor Author

@kollivier kollivier changed the title Discuss Ricecooker optimizations Ricecooker Studio upload optimizations Dec 26, 2019
@kollivier kollivier added this to the 0.7 milestone Dec 26, 2019
@rtibbles
Copy link
Member

Closing in favour of #321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants