-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow roadmap #43
Comments
Maybe storing the tokenlists internally as a pyarrow table might eventually make sense. I still want to library to connect to people's pandas workflows on the front end, like you suggest, but I'm down for tweaking the internals. When we return to this, it might also be worth considering how the development of pyarrow drives pandas. It might also be worth considering if any parts of the library should use |
Been flirting with the idea of a parquet-dataset backend... Just add 'id', 'library', and 'stubbycode' as dummy columns to the existing parquet files (which because of run-length encoding, would take up trivial additional space on disk, though you'd want to be careful not to read them in) and then the existing stubbytree structure matches one of the pyarrow parquetdataset layouts (not sure if it's Hive-compliant and you could address any corpus as a whole, out of memory. Not sure what the advantage would be, but it would certainly be easy to some things I like to, such as "find every page number that has 'evolution' on it." |
Yup, that's the intent. The standard filters have been partition-wide both for fastparquet and pyarrow, but the rowwise filters are awesome and I wonder why Pandas doesn't have it yet (maybe because it's experimental?). |
Pyarrow 7.0, released earlier this month, includes I haven't used the specific functions here, but in general with pyarrow.compute these things tend to work faster, with less memory overhead, and more cleanly than pandas equivalents. Arrow is currently used for the parquet reading, and as described above is faster for JSON parsing than writing to pandas directly. So if @organisciak et al are amenable, when I get a chance I'm going to take a look at rewiring the current internals of this package to use only arrow dataframes. Public methods would still return pandas frames by default--it's very fast to cast from arrow to pandas--but where useful/possible I would want escape hatches to the underlying arrow tables, and the indexes would be slapped on after aggregation rather than integral to creating the aggregations. Not yet clear to me if the arrow hatch would be:
|
I welcome this.
Solution #1 sounds most user-friendly to me, change the output by argument is already done in various places but the library's grown complex enough that I'd avoid it for future features.
The massivetexts fork has been a defacto dev branch during the SaDDL project, so may be worth checking if anything from there needs to be merged.
…________________________________
From: Benjamin Schmidt ***@***.***>
Sent: Friday, February 18, 2022 8:25 AM
To: htrc/htrc-feature-reader ***@***.***>
Cc: Peter Organisciak ***@***.***>; Mention ***@***.***>
Subject: [EXTERNAL] Re: [htrc/htrc-feature-reader] pyarrow roadmap (#43)
[External Email From]: ***@***.***
Pyarrow 7.0, released earlier this month, includes group_by, aggregate, and sum operations written in C++ and running--I think?--multithreaded naturally.
I haven't used the specific functions here, but in general with pyarrow.compute these things tend to work faster, with less memory overhead, and more cleanly than pandas equivalents. polars or duckdb might be faster and/or more feature complete, but IMO that wouldn't justify the overhead of even an optional dependency from this library.
Arrow is currently used for the parquet reading, and as described above is faster for JSON parsing than writing to pandas directly. So if @organisciak<https://urldefense.com/v3/__https://github.com/organisciak__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCOKoJamM$> et al are amenable, when I get a chance I'm going to take a look at rewiring the current internals of this package to use only arrow dataframes. Public methods would still return pandas frames by default--it's very fast to cast from arrow to pandas--but where useful/possible I would want escape hatches to the underlying arrow tables, and the indexes would be slapped on after aggregation rather than integral to creating the aggregations.
Not yet clear to me if the arrow hatch would be:
1. Parallel methods with different names (maybe the same names, prefixed w/ arrow_);
2. An additional argument to the pandas-generating functions.
3. An additional argument to Volume, Parser, or some other class that causes the pandas-generating function to change what they emit.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/htrc/htrc-feature-reader/issues/43*issuecomment-1044698208__;Iw!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCLI2gPGo$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAAOH3YZSQ4B5Q7PKZRZWGLU3ZQE7ANCNFSM43F5QONQ__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCSJjdIaA$>.
Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCishx2y4$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!NCZxaNi9jForCP_SxBKJCA!B8IKHgRUnPP9yLyTtM1SC1nIkVu54A5giIRxaA2xUXtbqbv1fWwptxLtcfPCKyPa18U$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
An in-progress FYI here. I poked at this a little more and noticed an interesting possibility in pyarrow/parquet that I've been using in my work on https://github.com/bmschmidt/nonconsumptive. The format I've settled into there does some fancy formatting to stash pyarrow structArrays--which are data frames--inside individual cells. There I'm storing the results for multiple books in a single frame. But in feature-reader, it's possible to serialize to a one-row parquet file with multiple columns--some of them are just metadata, but the full feature count data can also live in a single cell. This would make it possible, if done carefully, to create a parquet file that contains all of the information in the That's sketched out here in the pull request I just started. Create a single row dataframe with all of those, including tokencounts. Somewhat to my surprise, I still can't beat bz2 compression--there's perhaps a 10-20% penalty still with It's also a format that allows putting multiple EF files into a single parquet file that allows fast random access solving my inode problems with zip mucking. For my own purposes I'm pretty sure this is the only way I'll store EF files going forwards. First stab at a format, not including metadata, is this schema:
Downsides here:
|
Breaking out of #42.
With parquet backing, this library is fast--fast enough, I think, to sometimes feed straight into tensorflow without any intermediaries using the tf arrow Dataset strategy.
Currently a lot of the roadblocks I'm seeing are around pandas--indexes and just pandas being a little slow at this point. As @organisciak pointed out in that issue, _make_tokencount_df is faster using arrow to pandas than using pandas directly. (18ms vs 20ms on one book I'm testing with.)
But it's just 13ms if you simply stay as a pyarrow table, which is a better fit for many applications anyway.
So for the next version (3.0), I'd suggest moving internal data transfer in this module entirely to pyarrow and having indexed pandas frames be just a user-level convenience. Most of the work in the chunking and tokencount dataframes can be done right now.
I would code this up right now except that there's one place we still need pandas, which is for the frequent grouped sums. (e.g.,
df.groupby(['page', 'token']).sum()
). There a bunch of pyarrow based groupby applications out there--https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops, https://github.com/TomScheffers/wombat, https://github.com/jorgecarleitao/datafusion-python frontend to the Rust-based https://docs.rs/datafusion/3.0.0/datafusion/, etc.--but none have wide adoption, and it seems possible that pyarrow will deploy a C++-based partial SQL implementation that will be the natural fit. (The pyarrow.compute portion of the pyarrow suite already does some of the things this library uses, like lowercasing strings.) In six months or a year, I think this landscape will be a bit clearer, and could speed up a lot of operations here by 50% or more.The text was updated successfully, but these errors were encountered: