Skip to content

Improve IO Abstractions #718

@inishchith

Description

@inishchith

Summary

Task is to refactor the inputs and outputs directory and move into a single one called io.

Basic Example

  • all goes inside io - single files json.py, parquet.py and so on.
  • each file have classes for Reader and Writer - JsonReader / JsonWriter
  • all respect the parent class signature and have very simple usage - decisions abstracted - ex: all readers have read() and all writers have write() everything else is a private or a utility method. all extra parameters to tune then are keyword arguments.
  • imports will look very simple from application_sdk.io.json import JsonWriter and all usages look like json_reader.read() or json_writer.write()
  • close() will perform cleanup, upload to objectstore and return statistics

Motivation

  • we have two directories inputs and outputs - each files like json.py, parquet.py and iceberg.py
  • inputs have methods like get_dataframe, get_daft_dataframe - some have get_batched_dataframe, get_batched_daft_dataframe
  • none of these really respect the method signatures of parent/base class
  • imports look like from application_sdk.outputs.json import JsonOutput and usage look like json_output.write_batched_daft_dataframe()

it's really confusing what to use when and then to choose daft/pandas? why?

Drawbacks

open

Unresolved Questions

open

Reference Issues / PRs

A draft PR can be found here to work off of - #715

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions