Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress fields to rsync-s parser output #612

Open
bugs181 opened this issue Nov 15, 2024 · 13 comments
Open

Add progress fields to rsync-s parser output #612

bugs181 opened this issue Nov 15, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@bugs181
Copy link

bugs181 commented Nov 15, 2024

Trying to use jc to parse the output from the following command and the command just hangs with no output. Not sure what's going on.

Command:
rsync --timeout=60 --progress --bwlimit=2000 --recursive --mkpath --partial --append -i --stats --rsh="/usr/bin/sshpass -fsecrets ssh -o StrictHostKeyChecking=no" sshuser@sshhost:/remotePath /localPath/ | jc --rsync-s

I've tried both rsync and rsync-s with no luck.
I added --itemize-changes and --stats, with the same result.

Simulating the output from rsync, there is no output from jc.

echo 'sending incremental file list

backup-sync-file.r02
    132,349,952  88%    2.45MB/s    0:00:07' | jc --rsync-s -p

This is the regex I was using in javascript to parse the progress.

const regexMatch = /\s+(([0-9,]?)+\s+)(?<progress>.+?(?=%))/
const matches = data.match(regexMatch)

if (!matches) {
  lastItem = data // backup-sync-file.r02
  return
}

if (matches && matches.groups) {
  this.downloading = true
  this.emit('progress', { item: lastItem, value: Number(matches.groups.progress) })
}

rsync version 3.3.0 protocol version 31
jc version: 1.25.3

Edit: After trying many things, there was ONE time I got this:

{"type":"file","filename":"backup-sync-file.r03","metadata":">f+++++++++","update_type":"file received","file_type":"file","checksum_or_value_different":null,"size_different":null,"modification_time_different":null,"permissions_different":null,"owner_different":null,"group_different":null,"acl_different":null,"extended_attribute_different":null}
@kellyjonbrazil
Copy link
Owner

Thanks for reporting this! Could you simplify the rsync output by removing some options like --progress? I've only tested with -i, -iv, -ivv, -ivvv, -ivvv --log-file=xxx, and -avh.

https://github.com/kellyjonbrazil/jc/blob/master/tests/test_rsync.py

I can probably extend the output support if you could send me a larger sample.

Thanks!

@bugs181
Copy link
Author

bugs181 commented Nov 16, 2024

Sure, no problem!

These are some of the options I tried it with.

All of these have --recursive --mkpath --rsh="/usr/bin/sshpass -fsecrets ssh -o StrictHostKeyChecking=no"

$ rsync -v

receiving incremental file list
backups/backup-sync-file.cfg
backups/backup-sync-file.r00
$ rsync --stats

no output
$ rsync --info=progress2

33,292,524   0%    6.55MB/s    0:22:00  xfr#2, to-chk=60/63
$ rsync --info=stats

no output
$ rsync -axHSEAXhh --info=stats

no output
$ rsync --info=progress2 --human-readable

8.74G  98% 8142.89GB/s    0:00:00 (xfr#60, to-chk=0/63)
$ rsync --info=progress2 --human-readable --progress

receiving incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=62/63)
backups/backup-sync-file.r01
         19.60M   0%    5.82MB/s    0:24:49
$ rsync --itemize-changes

receiving incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=62/63)
>f+++++++++ backups/backup-sync-file.r01
         35.72M   0%    4.66MB/s    0:30:58
$ rsync -v --human-readable --stats --human-readable --itemize-changes --progress --out-format='[%i] %-100n'

[>f+++++++++] backups/backup-sync-file.r01
         26.88M  18%    3.67MB/s    0:00:32
$ rsync -v --human-readable --stats --human-readable --itemize-changes --progress --out-format='%n'

receiving incremental file list
backups/backup-sync-file.r01
         62.44M  43%    8.43MB/s    0:00:09

The option I'd REALLY like to see working is primarily:

$ rsync --info=progress2 --human-readable --progress

receiving incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=62/63)
backups/backup-sync-file.r01
         19.60M   0%    5.82MB/s    0:24:49

I believe the xfr#0 shows which file count you're currently on. And to-chk is about how many more files/dirs there are to check.

@kellyjonbrazil
Copy link
Owner

This one seems to work as designed:

$ rsync --itemize-changes | jc --rsync

receiving incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=62/63)
>f+++++++++ backups/backup-sync-file.r01
         35.72M   0%    4.66MB/s    0:30:58
[{"summary":{},"files":[{"filename":"backups/backup-sync-file.r01","metadata":">f+++++++++","update_type":"file received","file_type":"file","checksum_or_value_different":null,"size_different":null,"modification_time_different":null,"permissions_different":null,"owner_different":null,"group_different":null,"acl_different":null,"extended_attribute_different":null}]}]

Instead of using the --info flag, you can use various levels of verbosity (e.g. -ivvv). Also --human-readable and --progress are not supported. I'm not sure what the use-case would be for those when the output is being converted to JSON. Is there data missing with the supported command options?

@bugs181
Copy link
Author

bugs181 commented Nov 18, 2024

This one seems to work as designed:

$ rsync --itemize-changes | jc --rsync

receiving incremental file list
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=62/63)
>f+++++++++ backups/backup-sync-file.r01
         35.72M   0%    4.66MB/s    0:30:58
[{"summary":{},"files":[{"filename":"backups/backup-sync-file.r01","metadata":">f+++++++++","update_type":"file received","file_type":"file","checksum_or_value_different":null,"size_different":null,"modification_time_different":null,"permissions_different":null,"owner_different":null,"group_different":null,"acl_different":null,"extended_attribute_different":null}]}]

Instead of using the --info flag, you can use various levels of verbosity (e.g. -ivvv). Also --human-readable and --progress are not supported. I'm not sure what the use-case would be for those when the output is being converted to JSON. Is there data missing with the supported command options?

Sorry if I wasn't clear! I'm not opposed to using ANY or none of the flags. They're meaningless to me. My primary complaint is that the JSON output is incomplete.

Ideally, what we would be able to get are progress markers. I'd like to see these:

  • files complete / out of files checked
  • Current file being worked on
  • Progress of the current file
  • Progress of entire rsync command

Hope that clarifies the issue. My example although incomplete at least gives you current file being worked on and progress of current file.

@kellyjonbrazil
Copy link
Owner

Gotcha - I'm not sure what the use case is here, though.

Progress info is typically for human consumption, right? JSON is typically to hand off to another process. Why would you want to encode the progress into the serialized JSON? What would you do with that information in an automated fashion?

@bugs181
Copy link
Author

bugs181 commented Nov 18, 2024

Gotcha - I'm not sure what the use case is here, though.

Progress info is typically for human consumption, right? JSON is typically to hand off to another process. Why would you want to encode the progress into the serialized JSON? What would you do with that information in an automated fashion?

Perhaps I misunderstood the purpose of this tool. My understanding was that jc intends to take output from commands, and allow them to output meaningful standardized data (i.e JSON). I wasn't aware that jc wasn't meant to be used real-time and more for static information. I'd suggest that's a jc limitation if that isn't how it can operate. I was under the assumption it follows the *nix philosophy of pipe data in -> pipe data out. Which can always be streamed.

For me, i'd then use that JSON to present to the user in a webpage UI. As it stands, I'm doing all of the processing without the use of this tool. I'd prefer to use the right tool for the job - especially when tooling can output various different ways depending on platforms.

Here's an example of how I'm currently using rsync

    let lastItem
    child.stdout.on('data', (dataObj) => {
      const data = dataObj.toString()
      showDebug && console.log(`stdout: ${data}`)

      if (typeof data !== 'string')
        return

      // FIXME:
      /* rsync: [Receiver] failed to connect to ... (...): Network unreachable (101)
          rsync error: error in socket IO (code 10) at clientserver.c(138) [Receiver=3.2.4]
      */

      if (data.includes('error:')) {
        showDebug && console.log(`error: ${data}`)
        return this.error(data)
      }

      // https://regex101.com/
      // https://javascript.info/regexp-groups

      // Match rsync --progress output. Example: '\r     22,347,776   2%  193.51kB/s    1:27:09  '
      const regexMatch = /\s+(([0-9,]?)+\s+)(?<progress>.+?(?=%))/
      const matches = data.match(regexMatch)

      // TODO: Grab ETA download time, even if we can't use it, another UI might be able to. /stats
      // TODO: Grab speed also

      if (!matches) {
        lastItem = data
        return this.emit('log', data)
      }

      if (matches && matches.groups) {
        this.downloading = true
        this.emit('progress', { item: lastItem, value: Number(matches.groups.progress) })
      }
    })

What I'd ideally be able to do is pipe `jc` and refactor my code into something like this:
child.stdout.on('data', (jcOutput) => {
   const jc = JSON.parse(jcOutput)

   // Return progress result to user:
   event.emit('progress', { 
      file: jc.filename, 
      progress: jc.progress,
      totalFiles: jc.totalChunks,
      totalProgress: jc.totalProgress,
   })
})

@kellyjonbrazil
Copy link
Owner

kellyjonbrazil commented Nov 18, 2024

Standard jc parsers could not be used for this purpose because the final output of the input is converted in one go. By the time any progress information has been converted to JSON it is too late, so we don't typically convert such "interactive" output to JSON.

It is possible to do with the Streaming parsers, like --rsync-s since they output JSON Lines in realtime. It sounds like the request would be to add the progress fields to the --rsync-s (Streaming) parser so they could be presented downstream in a GUI.

@bugs181
Copy link
Author

bugs181 commented Nov 18, 2024

You would be correct. If you reference my initial post, you'll see I was attempting to use jc --rsync-s -p. If the only thing rsync-s would output would be the filename, then what would the point of that be? I'd imagine the majority of users wanting more progress information from a streamable.

@kellyjonbrazil
Copy link
Owner

kellyjonbrazil commented Nov 18, 2024

The point of the streaming parsers is to reduce memory consumption. If you are rsyncing millions of files, bundling those all up in a single JSON array could take Gigs of RAM. By streaming, the conversion process only takes Kilobytes, no matter how many files are synced. The end-result would be preserved as a record or saved to a database, etc. More of a batch or automated/non-interactive process.

This is an interesting new use-case. Seems like it wouldn't be too difficult to add the progress fields into the --rsync-s parser.

@bugs181
Copy link
Author

bugs181 commented Nov 18, 2024

I believe there's been a misunderstanding. All I'm wanting are progress markers. Not an entire file list. It appears your suggestion to add "progress fields" to the parser is what I'm asking for.

Edit:
I did find more information about how to-chk works.

to-chk or to-check gives you the number of files still to be checked, e.g. 0/1 means 0 of a total of 1 files in the queue still need to be checked.

During a normal transfer with, let's say 42 files, to-check will basically count down from 42 to 0 (to-check=38/42…to-check=2/42) until rsync is done syncing.

Note that for large amounts of files, the last number in to-check may also grow since rsync doesn't queue all files at once (to-check=38/42...to-check=2/56). That's because it uses a recursion algorithm:

Source

xfr field refers to the number of files that have been currently transferred for this run of rsync.

I hope that helps!

@kellyjonbrazil kellyjonbrazil added the enhancement New feature or request label Nov 18, 2024
@kellyjonbrazil kellyjonbrazil changed the title Rsync streaming output hangs Add progress fields to rsync-s parser output Nov 18, 2024
@kellyjonbrazil
Copy link
Owner

I was looking into how --progress and info=progress2 work and came across this video:

https://www.youtube.com/watch?v=-eELDwqIyGg

It looks like I can parse the progress info from --progress, but I won't be able to get the information from info=progress2 since that is not streaming the information line by line.

@kellyjonbrazil
Copy link
Owner

Unfortunately I can see from this video that even the --progress flag makes the progress line continually update during each file copy.

Some of the data will be useless, like % copied (will always be 100%) and ETA (will always be 0:00:00), but it looks like I could grab the transferred bytes, throughput speed, xfer#, and to-check data after the linefeed character is output.

https://www.youtube.com/watch?v=Pygr_TpZRpM

@bugs181
Copy link
Author

bugs181 commented Nov 20, 2024

Out of curiosity, are you sure it's not "line by line"? I think the terminal operator is presenting it as the same line - probably using some sort of terminal modifier ascii. In nodejs (javascript) land, stdout streams that line again and again and that's where the processing takes place.

Example:

// For  rsync --progress --partial --append
child.stdout.on('data', (dataObj) => {
  const data = dataObj.toString()

  // Output: 14,581,760   0%    2.20MB/s    0:13:22 
})

This regex:

const regexMatch = /\s+(([0-9,]?)+\s+)(?<progress>.+?(?=%))/
// Will correctly retrieve the progress percentage.  

I can play with the regex to retrieve the other information if you'd like some help with regex matching.
As for how to programmatically getting the stream and processing it, I'm not familiar enough with python to help there. I would assume that piping rsync into something else would likewise give you that line stream, but I could be wrong.

Looking at this stack question, I can see that they're piping rsync into zenity and doing some progress processing, similar to what you're trying to do. Source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants