-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemtest/gencorpora: generate corpus metadata #8937
Conversation
This pull request does not have a backport label. Could you fix it @lahsivjar? 🙏
NOTE: |
8ecaf8f
to
23f64c4
Compare
23f64c4
to
272d911
Compare
📚 Go benchmark reportDiff with the
report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few early comments. Looks good overall.
systemtest/gencorpora/catbulk.go
Outdated
for count := 0; scanner.Scan(); count++ { | ||
fmt.Fprintln(writer, scanner.Text()) | ||
if _, err := fmt.Fprintln(&buf, scanner.Text()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _, err := fmt.Fprintln(&buf, scanner.Text()); err != nil { | |
if _, err := buf.Write(scanner.Bytes()); err != nil { |
suggestion: I think we can avoid a byte->string conversion and use the buffer directly. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work since the bulk API requires newline delimited JSON and scanner strips EOL markers. However, it is possible to optimize performance here by customizing the scanner -- I was being lazy here since it is not in the critical code path. Let me update this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I have updated the code to split input expecting NDJSON doc with metadata followed by source. This should cleanup the code a bit and have less memory footprint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. LGTM.
Motivation/summary
A
corpora
is required for defining the rally track. The corpora schema requires a list of document files and each document has the following requirements:document-count
andsource-file
are mandatorycompressed-bytes
anduncompressed-bytes
is recommendedThis PR adds support to generate the document file with the above fields (other than
compressed-bytes
).Checklist
- [ ] Update CHANGELOG.asciidoc- [ ] Update package changelog.yml (only if changes toapmpackage
have been made)- [ ] Documentation has been updatedHow to test these changes
N/A
Related issues
Part of: #8754