Skip to content

Conversation

@jglick
Copy link
Member

@jglick jglick commented Feb 12, 2018

JEP-206

Extracted from #29. Downstream of #78.

private static final String COOKIE = "JENKINS_SERVER_COOKIE";

/**
* Charset name to use for transcoding, or the empty string for node system default, or null for no transcoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, empty string and null have different handling?

Uh..... 🐛 - if the behavior is different that should be a custom string, i.e. 'DEFAULT' or 'PRESERVE' or something, because otherwise this is going to trip us up somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is internal only, but sure a constant could be introduced for clarity.

if (cs.equals(StandardCharsets.UTF_8)) { // transcoding unnecessary as output was already UTF-8
return null;
} else { // decode output in specified charset and reëncode in UTF-8
return StandardCharsets.UTF_8.encode(cs.decode(ByteBuffer.wrap(data)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 We should do everything we can to allocate and reuse a buffer here -- i.e. create a transient byte[] / Buffer that is used for storing encoded output, rather than StandardCharsets.UTF_8.encode -- otherwise we're going to generate massive amounts of memory garbage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, hardly “massive” as this is typically dwarfed by unrelated overhead related to launching and joining processes, but I can check if there is some premature optimization possible here if you care.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate paths for optimization - process launch/join has some fixed overheads and some scaling issues that are addressed by changes in the algorithm.

If we're injecting new APIs though we should ensure the API is friendly to efficient implementations -- coupling directly to newly-allocated-and-not-reused byte[] will bite us.

@Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
@Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
byte[] buf = FileUtils.readFileToByteArray(f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Allocating and reading the whole file rather than doing streaming handling (or doing a chunk at a time with the buffer). This is a problem because it will generate tons of memory garbage and may create demand excessive memory spikes that impact stability.

Also, streaming APIs are just a more natural fit with our normal logging model if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for returnStdout: true, where we are anyway returning a String to the program. There is no reason to even attempt to stream anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse

if (toRead > Integer.MAX_VALUE) { // >2Gb of output at once is unlikely
throw new IOException("large reads not yet implemented");
}
// TODO is this efficient for large amounts of output? Would it be better to stream data, or return a byte[] from the callable?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this comment... really really this is the place to stream it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the other set of PRs which deprecate this code path anyway.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main issue: allocating and using large byte arrays rather than streaming data. This was bad when we were doing it once, but now we're doing it potentially multiple times... have no objection if we use limited buffers/byte arrays to process a chunk at a time in a streaming fashion.

@jglick
Copy link
Member Author

jglick commented Feb 19, 2018

process a chunk at a time in a streaming fashion

That happens in #60. Prior to that, the current output processing mode of FileMonitoringController is not streaming. This PR does not change that.

@jglick
Copy link
Member Author

jglick commented Feb 19, 2018

So to summarize, if this is in incorporated into #62, which performs streaming transcoding where necessary, then memory buffers are allocated only in case you use the relatively uncommon mode returnStdout: true (in which case we are already allocating a String, and doing more with it besides—typically saving it to program.dat); and even then only when the process was actually using a non-UTF-8 output encoding, which in practice means stuff running on Windows. This does not seem like a valuable target for optimization.

@jglick jglick requested review from svanoort and removed request for abayer and rsandell February 19, 2018 18:54
@jglick
Copy link
Member Author

jglick commented Feb 19, 2018

sorry @abayer & @rsandell, you cannot be reviewers any more

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, excepting the test code. Some refactoring would be helpful there.

🐝 anyway

* @param cs the character set in which process output is expected to be
*/
public void charset(@Nonnull Charset cs) {
// by default, ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some FINE/DEBUG logging to indicate that the method is not implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that. In practice there are only three implementations anyway, all in this plugin.

* If not called, no translation is performed.
*/
public void defaultCharset() {
// by default, ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@Override public byte[] getOutput(FilePath workspace, Launcher launcher) throws IOException, InterruptedException {
return getOutputFile(workspace).act(new MasterToSlaveFileCallable<byte[]>() {
@Override public byte[] invoke(File f, VirtualChannel channel) throws IOException, InterruptedException {
byte[] buf = FileUtils.readFileToByteArray(f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @jglick. It would be reasonable if the operation supported any parameters (e.g. tail or so), but currently the change does not make the code worse

}

@Issue("JENKINS-31096")
@Test public void encoding() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is unreadable IMHO. Would it be possible to split it to several tests or at least explicitly indicate test stages? Assert logic could be also refactored to a separate method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could be refactored to use a helper assertion method with parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(resolved)

@oleg-nenashev
Copy link
Member

@jglick Will defaultCharset() need to be invoked in other Durable Task implementations apart from Pipeline in order to retain the default behavior? IIUC yes. There are other durable task implementations, so it would need explicit changelog

@jglick
Copy link
Member Author

jglick commented Apr 3, 2018

There are other durable task implementations

Well, with a combined total of <200 installations, and probably those plugins are mistakes that should be deleted, but yes.

so it would need explicit changelog

Sure.

@Issue("JENKINS-31096")
@Test public void encoding() throws Exception {
JavaContainer container = dockerUbuntu.get();
DumbSlave s = new DumbSlave("docker", "/home/test", new SSHLauncher(container.ipBound(22), container.port(22), "test", "test", "", "-Dfile.encoding=ISO-8859-1"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test can run without jenkinsci/docker-fixtures#19 since it is only concerned with file contents, not name. (Passing -Dsun.jnu.encoding=… to the SSH launcher does not seem to work for that.)

@jglick jglick requested a review from dwnusbaum June 22, 2018 21:38
@jglick
Copy link
Member Author

jglick commented Jun 26, 2018

In jenkinsci/jep#128 (comment) @svanoort asked what it would take to ensure that newly produced non-ASCII content from durable tasks is correctly transcoded into builds which were started using a non-UTF-8 character set but continued running across the upgrade. AFAICT FileMonitoringController.transcodingCharset and .maybeTranscode and (in #62) Watcher.run would need to know the correct target encoding. For the Watcher.run, this information is available from TaskListener, but only as of Jenkins 2.91 (jenkinsci/jenkins#3122) and not even then from this context since it is ProtectedExternally. For other usages (writeLog, getOutput), we do not even have a TaskListener. So the target encoding would need to be injected explicitly, probably in the DurableTask.charset & .defaultCharset methods, with the caller (DurableTaskStep.Execution.start) looking up the Run.getCharset when a Run is available. Doable, but makes the code more complicated, and hardly seems worth the bother for a very minor bug affecting a small minority of users and even then at most once.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this is fine -- I just need to double-check the interaction with the CountingOutputStream and the transcoding and then it'll be blessed.

Although I know we're going to have some blowback from the incompatible aspects of this change no matter what we do (so I want to release after the CountingOutputStream fix).

Copy link
Member Author

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible bug noticed in #74 but actually applicable only here.

if (transcoded == null) {
sink.write(buf);
} else {
Channels.newChannel(sink).write(transcoded);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be writing a number of bytes to CountingOutputStream which is different than toRead (when the full write succeeds). Probably the transcoding needs to be done on the master side, which will be a little trickier since any defaultCharset call needs to happen on the agent side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the bug, if there is one, is rendered obsolete by #62 which renders this dead code.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do indeed need to handle differences in byte counts based on transcoding or we may have some odd and complex-to-solve bugs here.

Thanks to Jesse for confirming my understanding there is correct.

@jglick
Copy link
Member Author

jglick commented Jul 17, 2018

I suspect so, but plan to try to reproduce such a problem in a test to confirm.

…transcoding must be done on the master side.
@jglick jglick dismissed svanoort’s stale review July 25, 2018 21:16

should be addressed

@jglick jglick requested a review from svanoort July 25, 2018 21:16
@jglick
Copy link
Member Author

jglick commented Jul 25, 2018

Reproduced the suspected problem (some characters missing from output when doing transcoding) in the test and fixed it. I now suspect that #62 suffers from an analogous bug, but that is at least further down the road.

Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jglick Correct me if I'm wrong here, but if we do the transcoding master-side, won't this run into issues with external log storage that expects all content in UTF-8?

I may be misinterpreting something here (reviewing these already takes longer than it ought to due to tracing the all the transformations) but I think we can resolve this in the following way:

  1. Keep the CountingOutput stream on the remote and still report its byte counts
  2. Do the transcoding on the remote agent (wrapping or wrapped by
    the CountingOutputStream) and still send UTF-8 back.

Alternately, we can keep a count of remote byte offsets + local character or byte offset after transcoding.

Encodings are hard.

@jglick
Copy link
Member Author

jglick commented Jul 31, 2018

won't this run into issues with external log storage

No, because as mentioned before (#61 (comment)), the affected code is no longer used (a deprecated code path) when used with external log storage.

@jglick jglick requested a review from svanoort July 31, 2018 22:08
@jglick jglick changed the title [JENKINS-31096] Define API for gathering command output in a local encoding [JEP-206] Define API for gathering command output in a local encoding Aug 6, 2018
Copy link
Member

@svanoort svanoort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay after taking a look at that, but honestly don't have 100% confidence because there's such a maze of code changes in different places now and so many different plugin combinations possible.

On the plus side, we can trust that if this results in regressions, @jglick will make it his highest priority to resolve them.

@jglick @oleg-nenashev You guys absolutely MUST have test cases for character encoding situations around external log storage though if you don't already (enough to exercise different UTF-8, UTF-16, and 1-byte encodings well enough to generate Mojibake if something isn't kosher in encoding handling). Hopefully you already have that though.

@svanoort svanoort merged commit e1ef73b into jenkinsci:master Aug 6, 2018
@jglick
Copy link
Member Author

jglick commented Aug 7, 2018

test cases for character encoding situations around external log storage

The interesting logic is already covered by tests here; the external logging systems just assume they are getting UTF-8, and keep it that way. Anyway, as we develop “live” tests for such implementations we can include some sanity checks for encoding handling. Certainly it is a far lower priority since any problems would not be regression risks.

@jglick jglick deleted the UTF-8-JENKINS-31096 branch August 7, 2018 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants