Gsaksena mmd2 #67

gsaksena · 2017-10-27T14:53:00Z

Pull Request Template

Describe your changes here.

Style Checklist

Please ensure that your pull request meets the following standards for quality.
Code should not be merged into the master branch until all of these criteria have been satisfied.

Comments

Each source file includes comments at the top describing its purpose
Each function includes a comment/docstring describing inputs and outputs, and any assumptions it makes
Variable and function names have semantic meaning, and are not reused with a different meaning within the same scope
“Magic” numbers, such index of a particular column name, have a comment describing their value, or are declared as a global constant with a semantic name (e.g. TCGA_ID_COL = 16)
Commented-out code is removed

Style/Execution

Code contains no hard-coded paths
Code contains appropriate logging & or debugging
If possible, input data is validated early in the execution. If not, errors are sufficiently detailed to aid debugging.
Code uses a library (e.g. optparse, argparse) for command-line parsing

gsaksena · 2017-10-27T15:20:48Z

Items in the flow chart that are not in the code:

doing the atomic file writes for the key metadata files, and renaming them to .error if needed.
adding download/dice date to metadata file.

dheiman · 2017-10-27T21:35:55Z

gdctools/GDCtool.py

+                pass
+            else:
+                #other strings such as <yyyy-mm-dd>, 'latest', valid variable names, and everything else are not allowed  
+                raise ValueError("For gdc_mirror, date must be blank or 'pool'")


I thought the plan was to allow user-defined tags, and that anything besides 'latest' or a datestamp would work like 'pool'

Dropped other user defined tags because they are not needed to meet the requirement. YAGNI. Easy to add later, makes the code simpler and easier to document for today.

dheiman · 2017-10-27T21:42:12Z

gdctools/gdc_mirror.py

                help='Download files even if already mirrored locally.'+
                ' (DO NOT use during incremental mirroring)')
+        cli.add_argument('--append', default=False, action='store_true',
+                help='')


Please add a help description for --append

also, when action is set to 'store_true' the default is automatically set to False

Yes, docs in general.

dheiman · 2017-10-27T22:04:08Z

gdctools/gdc_mirror.py

+
+    @staticmethod
+    def __merge_mirror_metadata(old, new):
+        pass


please remove the pass statement.

dheiman · 2017-10-27T22:19:37Z

gdctools/gdc_mirror.py

+        for file_item in old:
+            key = file_item['file_id'] + file_item['md5sum']
+            file_item_dict[key] = file_item
+


this really feels like it should be a list comprehension

file_item_dict = {(file_item['file_id'] + file_item['md5sum']): file_item for file_item in old}

Could be. I prefer the idiom I used.

dheiman · 2017-10-27T22:25:12Z

gdctools/gdc_mirror.py

+            for file_item in new:
+                key = file_item['file_id'] + file_item['md5sum']
+                # TODO should there be a check whether the rest of the metadata matches?
+                file_item_dict[key] = file_item       


file_item_dict.update({(file_item['file_id'] + file_item['md5sum']): file_item for file_item in new})

The reason is that list comprehensions are quite a bit faster than loops. See https://stackoverflow.com/questions/16341775/what-is-the-advantage-of-a-list-comprehension-over-a-for-loop

Clarity is more valuable than performance here.

dheiman · 2017-10-27T22:27:24Z

gdctools/gdc_mirror.py

+        for key in file_item_keys:
+            file_item_list.append(file_item_dict[key])
+        return file_item_list
+


return [file_item_dict[key] for key in sorted(file_item_dict.keys())]

How does this handle when there is a new version of a file? Should these keys really include the md5sum, or should they just be filenames?

If there is a new version of the file, there will definitely be a new md5sum, and if the GDC does their job right there will also be a new file_id. The sorting is just to keep the file easier for humans to work with; the key could also include filename, datatype, etc, in whatever order, in addition to the file_id and md5sum to force things to be unique.

If the same file is downloaded twice, the metadata from the second download will be recorded.

If the file changes, both versions of the file will be kept, and the onus is on the loadfile generation to either select the right one or complain. I haven't checked this case.

The loadfile generator checks for duplicate samples, this will create duplicate aliquots, in which case one will simply be selected semi-randomly (the old method was to select the highest portion/plate value which in these cases will all be the same), unless the replicate filter calls that an error (I haven't checked). Perhaps the keys should be the aliquot IDs, that way the new will replace the old, which happens naturally in the normal mode of operations.

The poorly named lib.meta.tcga_id function can be used to pull the proper ID. To handle clinical, the 'data_category' field could be used in conjunction with the ID to create unique keys.

This appears to be an unresolved concern: duplicates can either slip through or be chosen arbitrarily; but in master this concern does not exist; this is grounds for not passing review.

gsaksena · 2017-10-29T04:11:55Z

I've checked in more code to address some of the comments.

I also fixed new issue #68, adding new MAF datatypes, and added a better error message to make adding new datatypes easier in the future

In addition, there is now some test coverage for the new features.

gsaksena · 2017-10-29T04:14:23Z

Issue#47 and issue#53 are being addressed in this branch

dheiman · 2017-10-30T15:20:29Z

Items in the flow chart that are not in the code:

doing the atomic file writes for the key metadata files, and renaming them to .error if needed.

adding download/dice date to metadata file.

Are these going to be put in the code?

gsaksena · 2017-10-30T16:06:19Z

Items in the flow chart that are not in the code:

doing the atomic file writes for the key metadata files, and renaming them to .error if needed.

adding download/dice date to metadata file.

Are these going to be put in the code?

No for #2. #1 is a separable feature that I'll get to or log as an issue depending on how other things go.

gsaksena · 2017-10-30T16:27:35Z

It would be better to file the .error renaming etc stuff as a new, separate issue. It has the potential to disrupt people's workflows.

dheiman · 2017-10-30T20:30:10Z

gdctools/GDCtool.py

+                        break
+                else:
+                    #TODO make this error message more helpful
+                    raise ValueError("Looking for latest datestamp, but no datestamps found in correct format") 


This actually means that there are no dates whatsoever

dheiman · 2017-10-30T20:33:47Z

gdctools/gdc_mirror.py

+            # Append mode has been requested in config file, coerce to boolean
+            # TODO also validate legal false values
+            value = config.mirror.append.lower()
+            config.mirror.append = (value in ["1", "true", "on", "yes"])


Should probably throw an error if it's an unknown false value

…r dice and mirror

…ast time

noblem · 2018-01-16T20:48:00Z

Gentlemen, what's the status of this branch and PR? I still see a handful or 2 of "TODO" statements in the code, some of which have Gordon signature. I'm revisiting GDCtools because I'd like to add the counts data that it generates (after mirror,dicing,etc) into a database.

gsaksena · 2018-01-16T21:20:08Z

Last I knew it was ready to go but approval/merge was stalled due to gdac time/attention being shifted to the gtex fire. Gordon On Jan 16, 2018 3:48 PM, "noblem" <[email protected]> wrote: Gentlemen, what's the status of this branch and PR? I still see a handful or 2 of "TODO" statements in the code, some of which have Gordon signature. I'm revisiting GDCtools because I'd like to add the counts data that it generates (after mirror,dicing,etc) into a database. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFK12f_H84JGjsssfi7Z4Y3CfDdFWfTmks5tLQsAgaJpZM4QJLVG> .

noblem · 2018-01-17T00:53:55Z

I see one issue:

Test that dicing throws error when mirror is corrupted

gdc_mirror.py: error: argument --config: can't open 'corrupt_mirror.cfg': [Errno 2] No such file or directory: 'corrupt_mirror.cfg'

Looks like the new cfg file was not checked in to your branch?

noblem · 2018-01-17T01:35:34Z

While you're at it: the pool1.cfg file is also missing

noblem · 2018-01-17T14:29:46Z

Gordon, would you mind throwing those 2 files "over the fence" and into the repo? Thanks!

…eading

noblem · 2018-01-17T22:07:17Z

Alright, I've cleaned this up a wee bit and verified that the new tests indeed pass. As a final step I'd like to meet with you at 1pm tomorrow, David (as the primary reviewer) to discuss a few points before giving this a thumbs up for merging to master.

gsaksena · 2018-01-17T22:46:55Z

I have a meeting tomorrow at 1. I can meet at 12 or after 2.

…

On Wed, Jan 17, 2018 at 5:07 PM, noblem ***@***.***> wrote: Alright, I've cleaned this up a wee bit and verified that the new tests indeed pass. As a final step I'd like to meet with you at 1pm tomorrow, David (as the primary reviewer) to discuss a few points before giving this a thumbs up for merging to master. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFK12ZFG23Y7zgVUQt7fWdVmLVJsoI9eks5tLm8VgaJpZM4QJLVG> .

noblem · 2018-01-17T22:51:27Z

Thanks Gordon, I chose the the 1pm time to fit with David's schedule ... I suspect we'll be fine if you cannot attend.

noblem

We went through the code today with a pretty fine-toothed comb, and while we liked several of the implementation ideas & code snippets, others gave cause for concern. We're going to meet again tomorrow to finalize, and probably take bits and pieces of the branch back to master instead of taking the entire thing in one gulp.

noblem · 2018-01-18T19:08:52Z

gdctools/gdc_dice.py

+                    dice_one_status = 'cached'

-                append_diced_metadata(file_dict, expected_paths,
+                duplicate_detected = append_diced_metadata(file_dict, expected_paths,


This seems to return the opposite value of the spirit of the call: by returning True for the bad condition when the append failed (due to having a duplicate); this prevents the code from passing review

noblem · 2018-01-18T19:10:25Z

gdctools/gdc_dice.py

        # Write row with csv.DictWriter.writerow()
        rowdict.update({
            'case_id'      : meta.case_id(file_dict),
-            'tcga_barcode' : meta.tcga_id(file_dict),


I recognize that this is old-ish code, but it should nevertheless aim to be agnostic to project, rather than explicitly referencing TCGA

noblem · 2018-01-18T19:11:22Z

gdctools/gdc_dice.py

+
+                # move completed metadata file to final location
+                # note it may include diced files that had errors
+                os.rename(diced_meta_file_partial, diced_meta_file)


This code (and the comment directly above) should not execute when diced files had errors; by definition, the file should remain .partial; this is grounds for not passing review.

Seems like all that's needed is:

if 'error' not in prog_status_tally: os.rename(...)

With the code as-is, a failure will cause attempts at running downstream steps to also fail. With your suggested change, the downstream step will pass but silently use the wrong (old) data.

Thanks for explaining, but I don't understand the logic/idea: a .partial is supposed to indicate a failure state, and the absence of .partial represents a success state. But you seem to be saying the opposite. Downstream steps that observe upstream .partial (failure state) should fail immediately, no? How is the current implementation "detecting failure," then, from an apparently successful state (as indicated by the absence of .partial)?

Downstream steps look only for the main output file, not .partial, .error, etc. This respects encapsulation (Also, I was getting dizzy when considering how to correctly and intuitively handle all the permutations of various files existing vs not.) A .partial is currently being used just to ensure that downstream steps never get a file that has been control-C'ed in the middle of. If some files are listed that do not exist, you will get a nonzero exit code, and the next step will also fail with a nonzero exit code.

So, in the case cited here, the metadata.partial is just atomicity for file-level operations, so we know if catastrophic event occurred (like ran out of disk space, or CTRL-C)?

Not quite sure what you're getting at about how not looking for .partial "respects encapsulation"?

But earlier you said by removing .partial ONLY when all files have diced causes downstream steps to "pass" by silently using wrong (old) data? But, if the only data that actually passed through the dicer is "old" then isn't that the correct thing to do?

Can you explain what a downstream failure looks like, then, if we keep the code "as is"? Does the gdc_loadfile (a) refuse to launch at all OR (b) launch and process everything it can, but skip the files that failed to dice (and flag them with warning messages, etc)?

Actually, allowing the loadfile to be created when the underlying data is not there makes the entire situation worse: because it will not be caught when gsutil is used to push data into cloud (because it acts like rsync, and doesn't know what should or shouldn't be there), BUT rather will cause the merge pipelines to mysteriously fail (and this is vastly harder situation to debug).

Moreover, each of the tools output what version of data they're operating upon, e.g. from dicer log

2018-01-19 01:08:52,229[INFO]: Mirror date: 2018_01_19

so I don't see the problem with charging ahead on old data--one is clearly notified-- and if one chooses to not pay attention to that notification then ... well, one is not being a careful analyst.

But charging ahead on old data would be prevented entirely by using .error (or .partial) on all files (data and metadata/json). In the sense that if gdc_loadfile sees gdc_dice.error then it complains and stops. If gdc_dice sees gdc_mirror.error it complains and stops. etc. What am I missing here? You want to avoid users making inadvertent mistakes, and this is a way to help that...right? Only way to remedy a .error file is to remove it and attempt to re-do that step which errored. And if you don't want to clear the error then your only option is to manually specify an older datestamp (or tag), in which case you are not confused about what data is being used.

This failing case would be better if gdc_loadfile checked for the existence of every file before moving forward, but that would make the common (passing) case much slower.

In the end it is a matter of tradeoffs. Streamline and bulletproof the use cases you care about most, and for the less central use cases try not to unpleasantly surprise the user.

We may be talking past one another, or there are (unstated?) assumptions lurking in our perspectives. In my view, if--when an error occurs--gdc_dice were to output metadata.json.error instead of metadata.json then downstream tools would not have to do anything at all, they just exit immediately. You seem to be opposed to this, and suggest that it's "better" to check every file, but I don't understand why? Because if there is no .error suffix then gdc_loadfile should not have to check any files, right? The absence of .error from gdc_dice is the contract to gdc_loadfile that everything within the metadata.json indeed exists ... as gdc_dice will have already iterated over the entire suite of files (by necessity).

With a .error file, we would need to handle the case where it coexists with a (no special extension) pass file and/or a .partial file... we don't have full control of the file system, and this case will arise somewhere, eg due to sloppy snapshot restores, if not from some race condition bug in our own code. We want to make sure that at least we don't do the wrong thing, and we would want to have regression tests around it or the feature will surely regress. Seems hard to make reliable.

With a .error file, or with a variant where the error condition is flagged by an extra row or column in the (no special extension) file, what action is the user supposed to do with it? We aren't expecting the user to go in and look at or (shudder) modify the metadata file. There are no proposals to be able to -force the error file to be applied. Once the issue is fixed, a .error or a error-marked (no special extension) file is still worthless, only waiting around to be redone.

The merger should reliably gripe if files are missing, either at the localization stage (if on Firecloud) or runtime (if running locally). The remedy (using the code in this PR) would be to fix the dicing somehow, upload the missing files, and move forward with the same loadfiles.

If you address issue 1, making the system logic sufficiently reliable for your satisfaction, I have no real problem with using .error files.

Thank you, Gordon. Using 'date' as synonymous with 'tag' ...

If a date.error file of any kind exists then downstream tools which would consume that respective file cannot use that date. Period. The tool will object to loading such and abort, so the user must either (a) clear that error (wherever it may occur upstream) or (b) try an earlier date. In either case, user has full knowledge that a problem exists and full control over the remedy.

See answer to (1)

Loadfiles should not be generated that cannot work. Period. This is a straightforward example of the principle that it saves time & effort to ascertain errors as far upstream as possible, and fail, rather than generate unusable results and punt them downstream where the problem is even harder to debug (e.g. gdc_dice or gdc_loadfile, versus FireCloud merger). Of course a merger should fail if a file doesn't exist; while that should be an almost impossible situation (e.g. gsutil lies that a file has been uploaded, when it hasn't), we should nevertheless guard against it in the merger code. What we should NOT do, though, is casually allow such a condition to arise by generating unusable load files; that situation can be easily identified--during GDCtools processing, where such problems would be catalyzed--and is more effectively diagnosed and remedied there instead of downstream where there are more layers of tech stack to complicate debugging.

noblem · 2018-01-18T19:11:51Z

gdctools/gdc_dice.py

                                     diced_project_root, mfw,
                                     dry_run=self.options.dry_run,
                                     force=self.force)
+                            prog_status_tally[dice_one_status] += 1


I do like the tallying

noblem · 2018-01-18T19:15:45Z

gdctools/gdc_mirror.py

+        for key in file_item_keys:
+            file_item_list.append(file_item_dict[key])
+        return file_item_list
+


This appears to be an unresolved concern: duplicates can either slip through or be chosen arbitrarily; but in master this concern does not exist; this is grounds for not passing review.

noblem · 2018-01-18T19:16:27Z

gdctools/gdc_mirror.py


        # Mirror each category separately, recording metadata (file dicts)
        file_metadata = []
+        proj_status_tally = collections.Counter()


👍 like the tallying!

gsaksena · 2018-01-18T20:48:12Z

Downstream steps look only for the main output file, not .partial, .error, etc. This respects encapsulation. A .partial is currently being used just to ensure that downstream steps never get a file that has been control-C'ed in the middle of.

…

On Thu, Jan 18, 2018 at 3:41 PM, noblem ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In gdctools/gdc_dice.py <#67 (comment)> : > diced_project_root, mfw, dry_run=self.options.dry_run, force=self.force) + prog_status_tally[dice_one_status] += 1 + + # move completed metadata file to final location + # note it may include diced files that had errors + os.rename(diced_meta_file_partial, diced_meta_file) Thanks for explaining, but I don't understand the logic/idea: a .partial is supposed to indicate a failure state, and the absence of .partial represents a success state. But you seem to be saying the opposite. Downstream steps that observe upstream .partial (failure state) should fail immediately, no? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFK12Re7zHnngdbj0Ym1nRyFR_NCAFmnks5tL6yMgaJpZM4QJLVG> .

gsaksena added 14 commits October 4, 2017 12:42

Auto-create sandbox for tests if it is not already there.

822320e

back out changes for first time directory creation

44ffe88

Merge branch 'master' into gsaksena_mmd2

b4e9f8e

updates for atomic mirror file saving

cbaafd7

removed metadata-based mirror caching, added initial status aggregation

518409f

propagate mirror status to top

9ce8811

tweak top level error reporting for gdc_mirror

e2fb22b

make dicer data file writes atomic, as a responsibility of the converter

7da50d1

factor out file naming magic from dice converters

3ec8fa1

debug previous checkin

be4f31c

add error tolerance and aggregate reporting to gdc_dice

855d011

coded changes to set datestamp the new way

cf9b5f8

add code to optionally append mirror metadata

f9e3e9d

fix --append mode, properly interleave mirror metadata

2bf501b

gsaksena assigned dheiman Oct 27, 2017

dheiman reviewed Oct 27, 2017

View reviewed changes

gsaksena added 8 commits October 28, 2017 15:19

updated help docs

7b8e95c

tweaked docs

2557ad5

improve merge mirror metadata semantics

2155b01

make unrecognized datatypes give more helpful error msg

03fe15d

add 3 new maf datatypes, update tests

613c77c

raise exception at end of failing mirror and dice

4f4f39e

add test that dicer actually returns an error

57b3c7d

add test_pool

47c5ff1

gsaksena added 2 commits October 29, 2017 00:01

add new tests to top level target

2949a39

delete commented out code

ecf5ecf

dheiman reviewed Oct 30, 2017

View reviewed changes

gsaksena added 4 commits November 6, 2017 11:26

add duplicate file detection, and first write metadata to .partial fo…

b61aa36

…r dice and mirror

fix test_pool for case where pool_sandbox dir wasn't left over from l…

6566b16

…ast time

add error message when duplicate input files detected

97886da

merge from master, update baseline files to make tests pass

1c08992

gsaksena_mmd2 branch: update test baselines for latest GDC data release

4653128

gsaksena and others added 3 commits January 17, 2018 16:18

adding additional .cfg files for tests

003518b

remove leftover header comments that do not apply here, to avoid misl…

40a5524

…eading

ensure that pool and corrupted dicing sandbox areas are also cleaned up

96f9b32

noblem suggested changes Jan 18, 2018

View reviewed changes

Gsaksena mmd2 #67

Are you sure you want to change the base?

Gsaksena mmd2 #67

Uh oh!

Conversation

gsaksena commented Oct 27, 2017

Pull Request Template

Style Checklist

Comments

Style/Execution

Uh oh!

gsaksena commented Oct 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsaksena Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsaksena Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsaksena commented Oct 29, 2017

Uh oh!

gsaksena commented Oct 29, 2017

Uh oh!

dheiman commented Oct 30, 2017

Uh oh!

gsaksena commented Oct 30, 2017

Uh oh!

gsaksena commented Oct 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noblem commented Jan 16, 2018

Uh oh!

gsaksena commented Jan 16, 2018 via email

Uh oh!

noblem commented Jan 17, 2018

Uh oh!

noblem commented Jan 17, 2018

Uh oh!

noblem commented Jan 17, 2018

Uh oh!

noblem commented Jan 17, 2018

Uh oh!

gsaksena commented Jan 17, 2018 via email

Uh oh!

noblem commented Jan 17, 2018

gsaksena Oct 28, 2017 •

edited

Loading

gsaksena Oct 28, 2017 •

edited

Loading

noblem Jan 18, 2018 •

edited

Loading

noblem Jan 18, 2018 •

edited

Loading

gsaksena Jan 18, 2018 •

edited

Loading

noblem Jan 19, 2018 •

edited

Loading

noblem Jan 24, 2018 •

edited

Loading