Skip to content

Fix SingleStream and MultiStream reciprocal score comparison for TEST04 #1491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions compliance/nvidia/TEST04/verify_performance.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,11 @@ def main():
if ref_mode == "SingleStream":
if re.match(".*Early stopping 90th percentile estimate", line):
ref_score = line.split(": ", 1)[1].strip()
ref_score = 1e9 / float(ref_score)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can potentially break the below check as the check assumes throughput metric for all the scenarios.
https://github.com/mlcommons/inference/pull/1491/files#diff-5c101fd75c9062a7dec72722d5f4aafe66e7d55fbc32f97378b74c08034c272cR141

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but why a similar approach (of not using reciprocals) does not cause any issue for TEST01 and TEST05?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST01 is doing comparison like this - both upper and lower bounds are checked and so we don't need reciprocal but the check is stricter than required.

TEST05 also had a similar check but was removed in a PR before 3.1 and so is currently broken for the offline scenario. This PR should fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the TEST05-related PR has been merged, should we do TEST04 in the same way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine to me. But I don't know the full reasoning behind taking this reciprocal. @nv-ananjappa any suggestions here?

continue

if ref_mode == "MultiStream":
if re.match(".*Early stopping 99th percentile estimate", line):
ref_score = line.split(": ", 1)[1].strip()
ref_score = 1e9 / float(ref_score)
continue

if ref_mode == "Server":
Expand Down Expand Up @@ -94,13 +92,11 @@ def main():
if test_mode == "SingleStream":
if re.match(".*Early stopping 90th percentile estimate", line):
test_score = line.split(": ", 1)[1].strip()
test_score = 1e9 / float(test_score)
continue

if test_mode == "MultiStream":
if re.match(".*Early stopping 99th percentile estimate", line):
test_score = line.split(": ", 1)[1].strip()
test_score = 1e9 / float(test_score)
continue

if test_mode == "Server":
Expand Down
12 changes: 6 additions & 6 deletions loadgen/issue_query_controller.cc
Original file line number Diff line number Diff line change
Expand Up @@ -459,8 +459,8 @@ void IssueQueryController::IssueQueriesInternal(size_t query_stride,
#if USE_NEW_LOGGING_FORMAT
std::stringstream ss;
ss << "IssueQueryThread " << thread_idx
<< " Ending early: Too many outstanding queries." << " issued "
<< queries_issued_total << " outstanding "
<< " Ending early: Too many outstanding queries."
<< " issued " << queries_issued_total << " outstanding "
<< queries_outstanding;
MLPERF_LOG_ERROR(detail, "error_runtime", ss.str());
#else
Expand Down Expand Up @@ -499,8 +499,8 @@ void IssueQueryController::IssueQueriesInternal(size_t query_stride,
#if USE_NEW_LOGGING_FORMAT
std::stringstream ss;
ss << "IssueQueryThread " << thread_idx
<< " Ending early: Max query count reached." << " query_count "
<< queries_issued;
<< " Ending early: Max query count reached."
<< " query_count " << queries_issued;
MLPERF_LOG_ERROR(detail, "error_runtime", ss.str());
#else
detail.Error("IssueQueryThread ", std::to_string(thread_idx),
Expand All @@ -519,8 +519,8 @@ void IssueQueryController::IssueQueriesInternal(size_t query_stride,
#if USE_NEW_LOGGING_FORMAT
std::stringstream ss;
ss << "IssueQueryThread " << thread_idx
<< " Ending early: Max test duration reached." << " duration_ns "
<< duration.count();
<< " Ending early: Max test duration reached."
<< " duration_ns " << duration.count();
MLPERF_LOG_ERROR(detail, "error_runtime", ss.str());
#else
detail.Error("IssueQueryThread ", std::to_string(thread_idx),
Expand Down
3 changes: 2 additions & 1 deletion loadgen/logging.cc
Original file line number Diff line number Diff line change
Expand Up @@ -812,7 +812,8 @@ void Logger::CollectTlsLoggerStats(TlsLogger* tls_logger) {
if (max_entry_vector_size > kTlsLogReservedEntryCount) {
#if USE_NEW_LOGGING_FORMAT
std::stringstream msg;
msg << "Logging allocation detected:" << " tid: " << tls_logger->Tid()
msg << "Logging allocation detected:"
<< " tid: " << tls_logger->Tid()
<< " reserved_entries: " << kTlsLogReservedEntryCount
<< " max_entries: " << max_entry_vector_size;
MLPERF_LOG_WARNING((*this), "warning_generic_message", msg.str());
Expand Down
51 changes: 33 additions & 18 deletions loadgen/logging.h
Original file line number Diff line number Diff line change
Expand Up @@ -119,20 +119,26 @@ class ChromeTracer {
void AddCompleteEvent(const std::string& name, uint64_t pid, uint64_t tid,
PerfClock::time_point start, PerfClock::time_point end,
const Args... args) {
*out_ << "{\"name\":\"" << name << "\"," << "\"ph\":\"X\","
<< "\"pid\":" << pid << "," << "\"tid\":" << tid << ","
*out_ << "{\"name\":\"" << name << "\","
<< "\"ph\":\"X\","
<< "\"pid\":" << pid << ","
<< "\"tid\":" << tid << ","
<< "\"ts\":" << Micros(start - origin_).count() << ","
<< "\"dur\":" << Micros(end - start).count() << "," << "\"args\":{";
<< "\"dur\":" << Micros(end - start).count() << ","
<< "\"args\":{";
AddArgs(args...);
*out_ << "}},\n";
}

template <typename... Args>
void AddAsyncBeginEvent(const std::string& name, uint64_t pid, uint64_t id,
PerfClock::time_point time, const Args... args) {
*out_ << "{\"name\":\"" << name << "\"," << "\"cat\":\"default\","
<< "\"ph\":\"b\"," << "\"pid\":" << pid << "," << "\"id\":" << id
<< "," << "\"ts\":" << Micros(time - origin_).count() << ","
*out_ << "{\"name\":\"" << name << "\","
<< "\"cat\":\"default\","
<< "\"ph\":\"b\","
<< "\"pid\":" << pid << ","
<< "\"id\":" << id << ","
<< "\"ts\":" << Micros(time - origin_).count() << ","
<< "\"args\":{";
AddArgs(args...);
*out_ << "}},\n";
Expand All @@ -141,9 +147,12 @@ class ChromeTracer {
template <typename... Args>
void AddAsyncInstantEvent(const std::string& name, uint64_t pid, uint64_t id,
PerfClock::time_point time, const Args... args) {
*out_ << "{\"name\":\"" << name << "\"," << "\"cat\":\"default\","
<< "\"ph\":\"n\"," << "\"pid\":" << pid << "," << "\"id\":" << id
<< "," << "\"ts\":" << Micros(time - origin_).count() << ","
*out_ << "{\"name\":\"" << name << "\","
<< "\"cat\":\"default\","
<< "\"ph\":\"n\","
<< "\"pid\":" << pid << ","
<< "\"id\":" << id << ","
<< "\"ts\":" << Micros(time - origin_).count() << ","
<< "\"args\":{";
AddArgs(args...);
*out_ << "}},\n";
Expand All @@ -152,15 +161,19 @@ class ChromeTracer {
template <typename... Args>
void AddAsyncEndEvent(const std::string& name, uint64_t pid, uint64_t id,
PerfClock::time_point time) {
*out_ << "{\"name\":\"" << name << "\"," << "\"cat\":\"default\","
<< "\"ph\":\"e\", " << "\"pid\":" << pid << "," << "\"id\":" << id
<< "," << "\"ts\":" << Micros(time - origin_).count() << "},\n";
*out_ << "{\"name\":\"" << name << "\","
<< "\"cat\":\"default\","
<< "\"ph\":\"e\", "
<< "\"pid\":" << pid << ","
<< "\"id\":" << id << ","
<< "\"ts\":" << Micros(time - origin_).count() << "},\n";
}

template <typename... Args>
void AddCounterEvent(const std::string& name, uint64_t pid,
PerfClock::time_point time, const Args... args) {
*out_ << "{\"name\":\"" << name << "\"," << "\"ph\": \"C\","
*out_ << "{\"name\":\"" << name << "\","
<< "\"ph\": \"C\","
<< "\"pid\":" << pid << ","
<< "\"ts\":" << Micros(time - origin_).count() << ","
<< "\"args\":{ ";
Expand Down Expand Up @@ -720,13 +733,15 @@ void AsyncLog::LogDetail(const std::string& key, const T& value,
}
auto time_ns = (log_detail_time_ - log_origin_).count();
for (auto os : detail_streams) {
*os << ":::MLLOG {" << "\"key\": " << ArgValueTransform(key) << ", "
*os << ":::MLLOG {"
<< "\"key\": " << ArgValueTransform(key) << ", "
<< "\"value\": " << ArgValueTransform(value) << ", "
<< "\"time_ms\": " << ArgValueTransform(time_ns / 1000000ULL) << "."
<< std::setfill('0') << std::setw(6)
<< ArgValueTransform(time_ns % 1000000ULL) << ", "
<< "\"namespace\": \"mlperf::logging\", "
<< "\"event_type\": \"POINT_IN_TIME\", " << "\"metadata\": {"
<< "\"event_type\": \"POINT_IN_TIME\", "
<< "\"metadata\": {"
<< "\"is_error\": " << ArgValueTransform(error_flagged_) << ", "
<< "\"is_warning\": " << ArgValueTransform(warning_flagged_) << ", "
<< "\"file\": \"" << file_name << "\", "
Expand Down Expand Up @@ -755,9 +770,9 @@ void AsyncLog::LogDetail(const std::string& message, const Args... args) {
detail_streams.pop_back();
}
for (auto os : detail_streams) {
*os << "\"pid\": " << current_pid_ << ", " << "\"tid\": " << current_tid_
<< ", " << "\"ts\": " << (log_detail_time_ - log_origin_).count()
<< "ns : ";
*os << "\"pid\": " << current_pid_ << ", "
<< "\"tid\": " << current_tid_ << ", "
<< "\"ts\": " << (log_detail_time_ - log_origin_).count() << "ns : ";
if (error_flagged_) {
*os << "ERROR : ";
} else if (warning_flagged_) {
Expand Down
Loading