-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle multi-Tasks in input plugins by |ServiceDataSplitter| and related classes. #76
Conversation
@dmikurube The framework looks good to me 👍 Two comments for now.
|
@muga Thanks.
Good catch. It is not actually a dispatcher. :) Will rename.
Of course! Will do. |
…ters in |hintPerTask|. It refactors the inner class of |DefaultServiceDataSplitterBuilder| out as |DefaultServiceDataSplitter|.
@muga Addressed the former of your comments in 19d2fbd. I then thought adding examples, but I found that To provide an example, we may need to assume a certain specific web service, and make a plugin for that. Or, do you have good ideas of examples? |
I have one example for REST API based input plugins. Several external REST API provides exporting accounts' data by filtering time range like ['2000-01-01' <= created_at .. < '2017-01-01). If the range is too large, it might take a long time to ingest data. Because data ingest is executed on one single input task. The one task is executed by one thread. If we can divide the large time range on the input task by monthly basis, those multiple smaller input tasks are executed in parallel by multiple threads. How about this? About the name of For output plugins, I thought that this feature doesn't need to care of output plugins. Because output plugin's |
I understand the datetime example conceptually, but how do we implement that as a concrete practical example? There are many kinds of datetime-based services in the world. Plugins will work for actual data in the target service. In many cases, the plugin needs to get the actual range of datetime at first, and then split. Even if the configuration is for 2014-01-01 till 2016-12-31, should we still split the input per month if the actual data is only in 2015-04? Also, service A may explicitly take the beginning and the ending as dates in UTC formatted in "20170418". Service B may take dates/times with timezones formatted in "Apr 18, 2017 4:12pm". There are many variations. Finally, even if we assume the entire range is given from the configuration, do we know which parameter contains the range? We can pass just What is good for examples: that's my question. It will depend on the very specific service. It cannot be general. Examples need assumption how the service deal with datetime. (I mentioned about "output" just to compare. For the name "What to split" is the entire data. We're splitting the entire data to (for) tasks. We're not a big |
I agree with you and I also don't think that we can provide a general example. But, it's OK for me to provide an example for a specific service. By seeing it, plugin developers could understand how to split datetime range at least. It's also OK for me not to prepare an example for now. If we could find a good example, we can write that at that time.
In my above example, yes an user needs to split datetime range per month. As another idea for for this case, we can take split datetime range by daily basis and share 2015-04 to all tasks to avoid a kind of skew issue.
Yes, I think that plugins need to have that calculation. The base library should not support it.
In my understanding, For example, if we have the following Embulk config:
,
Here So,
How about this? It seems that the role of |
@muga Okay, we don't have such an example very soon. Examples are to be considered later. #78 For the usage:
It can be implemented like that, but not of my intention. We've seen memory consumption issues in listing all indexes in input-s3. We don't want to embed all sub-indexes in global For your case :
Or, just to share the range size :
|
Yes, your idea works if same input task always can take same datetime sub range at least. Because input tasks are sometimes retried on hadoop by using mapreduce executor. In your case, I think that you can merge this PR 👍 |
Thanks! Merging this PR, and releasing v0.5.0 soon. Let's start (experimental) implementation with |
Yes It's good to me 👍 |
@muga Kinds of WIP, but some compilable.