Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet output format #29

Open
igorbrigadir opened this issue Aug 20, 2021 · 5 comments
Open

Parquet output format #29

igorbrigadir opened this issue Aug 20, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@igorbrigadir
Copy link
Collaborator

Instead of CSVs, append the parsed dataframes to parquet https://stackoverflow.com/a/47839247/11090908

@edsu
Copy link
Member

edsu commented Aug 20, 2021

Being able to output as parquet would be nice too--even if it's called twarc-csv :-)

@igorbrigadir
Copy link
Collaborator Author

Yeah I'm actually considering a different command as an alias, just for it to make semantic sense / good docs, so these would be the same:

twarc2 dataframe --output-format parquet input.json output.parquet

twarc2 csv --output-format parquet input.json output.parquet

But not sure how useful that is. It'll purely be an alias for a docs entry and for the command line.

@edsu
Copy link
Member

edsu commented Aug 21, 2021

I was going to say that pandas has many output formats. It might not be hard to add parquet, pickle, hdf, sql, excel, json, html, feather, latex, stata, gbq, markdown, ... :-) but like you said, figuring out the api is the hard part.

@igorbrigadir
Copy link
Collaborator Author

Yeah - still figuring out that part!

@igorbrigadir
Copy link
Collaborator Author

Still haven't figured this out, but for now, you can use DataFrameConverter to get a python DataFrame object which you can convert yourself. I'll keep this open for implementing the actual command later.

Maybe an alias?

twarc2 dataframe input.jsonl output.parquet

or

twarc2 dataframe --output-format parquet input.jsonl output.parquet

or

twarc2 csv --output-format parquet input.jsonl output.parquet

@igorbrigadir igorbrigadir added the enhancement New feature or request label Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants