Feature/memory profiler#7983
Conversation
kuke
left a comment
There was a problem hiding this comment.
Two things need to be discussed: 1) Is the way to count the memory reasonable; 2) There are two many outputs in the profiling report, some of them maybe need to be removed.
| std::max(event_time, event_items[index].max_time); | ||
|
|
||
| // total memory used | ||
| event_items[index].total_time += event_memory_used; |
There was a problem hiding this comment.
total_time -> total_memory_used
| for (size_t j = 0; j < events_table[i].size(); ++j) { | ||
| EventItem& event_item = events_table[i][j]; | ||
|
|
||
| app_total_time += event_item.total_time; |
There was a problem hiding this comment.
It is not correct to count the total time like this. There may be overlap between different events, also there may be gap between the end time of one event and the start time of the next.
| EventItem& event_item = events_table[i][j]; | ||
|
|
||
| app_total_time += event_item.total_time; | ||
| app_total_memory += event_item.total_memory_used; |
There was a problem hiding this comment.
Could you please explain how to obtain the memory occupation for every operator? I feel that the memory may be malloced and freed frequently, and often be reused. And total_memory_used calculated in this way would be meaningless.
I also have a doubt about this. Better to give some results and reasonable analysis. You can use the Now the profiler for time is suitable for multithreading. Is this memory counting is suitable for multithreading? |
chengduoZH
left a comment
There was a problem hiding this comment.
It is great to count the memory usage by profiling!
But counting memory usage is different from counting time-consuming. Because time-consuming does not care about what happens inside the operator. counting memory usage should record the peak of memory used inside the operator.
You can think about conv_op, it creates a col to store the result of im2col when the conv_op is over, the memory will be released.
|
|
||
| double event_memory_used = rit->MemoryUsed(events[i][j]); | ||
| double total_memory_used = | ||
| static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024); |
There was a problem hiding this comment.
double total_memory_used = static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);
==>
double total_memory_used = static_cast<double>(rit->GetMemoryUsed() + event_memory_used) * kMegabyte;
Where kMegabyte equals to 1.0/1024/1024.
There was a problem hiding this comment.
event_memory_used means the memory of this operator creating's.
total_memory_used means, up to now, the total memory has been used, it should include event_memory_used.
There was a problem hiding this comment.
we must overload the placement new operator to reach the goal.
| // average time and memory used | ||
| for (auto& item : event_items) { | ||
| item.ave_time = item.total_time / item.calls; | ||
| item.ave_memory_used = item.total_memory_used / item.calls; |
There was a problem hiding this comment.
I don't think item.ave_memory_used is necessary.
average memory used is confusing.
There was a problem hiding this comment.
I tend to count the memory used in the most case.
|
Please use this profiler latest code. |
|
在uniform random op里加入profiler,只跑一个op,结果如下。 结果符合预期,多分配了0.002 MB, 原因是allocator有magic number, 做alloc和free校验。 -------------------------> Profiling Report <-------------------------
Place: CPU Total Time:9.31106ms Total Memory:2.99219MB Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Total Memory.Min Memory. Max Memory.
thread0::uniform_random 1 5.17344 5.17344 5.17344 5.17344 0 2.99219 2.99219
thread0::fetch 1 1.42045 1.42045 1.42045 1.42045 2.99219 2.99219 2.99219
|
|
作为对比,检验profiler在整个模型上的累加是否正确 加入profiler之前的测试结果: @QiJune dzhwinter/benchmark#67
maximum memory usage: 50622464 --> 43061248
maximum memory usage: 1729540096 --> 1132953600
maximum memory usage: 1275125760 --> 663941120 |
|
使用profiler统计结果。在benchmark表格中可见。 mnist batch=64, 21M mnist batch=128, 41M |

hack the memory profiler for memory debugging and benchmark. When I do this job, I find that the structure of profiler needs to make it readable.
Use
with profiler.profiler('CPU', 'total')as prof to package the profiling code.profiler.reset_profiler()can be used to clear the previous records.A simple usage is as follows:
Please see this for demo usage
dzhwinter/benchmark#80