Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for Data-Forge version 2 #108

Open
ashleydavis opened this issue Feb 13, 2021 · 20 comments
Open

Plans for Data-Forge version 2 #108

ashleydavis opened this issue Feb 13, 2021 · 20 comments
Labels
version-2 Saving this for Data-Forge version 2.

Comments

@ashleydavis
Copy link
Member

ashleydavis commented Feb 13, 2021

This issue is to discuss plans for version 2.

These is just ideas for the moment. I haven't started on this and am not sure when I will.

Plans for v2:

  • Minimize breaking changes
  • How do I make Data-Forge easy to use and easier to get started with?
    • Lazy evaluation is good for performance, but it makes DF hard to understand, does lazy evaluation need to die?
    • If lazy evaluation were removed, the internals of DF could be massively simplified (getting rid of all the iterables/iterators).
    • If lazy evaluation were removed you could look at a Series or DataFrame to see the current data that's in there (instead of say, having to call toArray).
    • We could say that splitting so that it fits in memory should happen above DF and is not the responsibility of DF.
  • Move plugins to the same repo (plugins will be republished under the org, e.g. @data-forge/fs)
  • Revise, improve and integrate the documentation (supported by having all the code for plugins in the one repository)
  • Delegate all maths to a pluggable library.
    • This means we can swap between floating-point and decimal maths (for the people who need that)
  • Better support for statistics (e.g. linear regressions, correlation, etc) I'm already working through this in v1.
  • Revise and overhaul serialization (e.g. support serialization/deserialization of JavaScript date objects)
    • Better support for mixed data types in columns (serializing the column type doesn't work for this, might need to serialize per-element type, I like the way MongoDB serializes dates to JSON, "$date").
  • Investigate replacing iterators with generator functions I've investigated this now and it doesn't seem possible.
  • Add map, filters and reduce functions (this is done now), deprecate select and where functions (make it more JavaScript-like)
  • Support streaming data (e.g. for processing massive CSV files)
    • Ideally DF would be async first and be used to define pipelines for async streaming data, but does async go against the goal of making DF easier to use? Is there a way that I can make it so that async usage is friendly?
    • I'm now thinking that async and parallelisation are higher level concerns that exist above DF and are not DF's responsibility.
  • Define a format/convention for running transformations (map?) and accumulations (reduce?) over a series / dataframe.
  • It would be great if somehow Series and DataFrame were integrated. Afterall DataFrame is just a Series with columns attached. Having seperate Series and DataFrame is good for easy to browse documentation, but it makes for a lot of duplicated code. If DataFrame could just derive from Series that would be quite nice, except they have differing functionality. This needs some thought.
    Stretch goals:
  • Better performance (Using Tensorflow.js ???)
@ashleydavis ashleydavis added the version-2 Saving this for Data-Forge version 2. label Feb 13, 2021
@nemosmithasf
Copy link

Adding my voice towards better support for large datasets

@rat-matheson
Copy link

rat-matheson commented Feb 15, 2021

Some great ideas there! I'm not well versed enough in DataForge to give strong suggestions but I have written a couple pipe/stream libraries in the past and have some thoughts.

  1. RE:pluggable library - does this mean using something like n-api to facilitate faster operations on C++? If it was possible to do performant vector and matrix math with data forge, it would become a real contender so that folks don't have to learn python or R. In conjunction with large datasets support, this would be a huge success

  2. API design - A minimal library that is discoverable via intellisense and yet extendable without having to build a special version is ideal. A really great way to do this might be to keep DataForge core as being extremely minimal and then having a small number of apply/transform/or map function to carry out plugin operations. I know form your comments above that you are already thinking about how to do this. I'm curious as to how you imagine it but here are some thoughts I have.

Example:

// DF today
let series = myDf.getSeries('someCol');
return <SummaryStats>{
    // I haven't looked at the implementation but I imagine each of these series min/max has to evaluate
    // over the entire collection (meaning loading it into memory).  The future idea below loads in memory just once
    // and could easily be streamed so that it is only partially loaded
    min: series.min(),
    max: series.max(),
    ...
}

// The issue with the above is that 'series' needs to know all the statistical functions I want as a user 
// and have an implementation for them.  In addition, each call is a separate evaluation across the entire dataset

//Future idea??
import { SummaryStatsFactory } from 'simple-statistics/data-forge';
import { getXPercential } from './custom-operations/getXPercential'

myDf.getSeries('someCol')
    // I'm imagining that once summarizeRows is finally evaluated, it returns a new DataFrame with the summaries as columns
    // And that I may want to join that DF with something else
    .summarizeRows([
        // So still disoverable but plugins can be external.  Could write wrappers to populate packages so that
        // users don't need to learn new interfaces
        SummaryStatsFactory.getAverage({name:'SomeColAverage'}), 
        SummaryStatsFactory.getMinimum(),

        // add a custom function
        getXPercential({ percential:0.8, name:'SomeCol80thPercentile' }),
        ...
    ])

    // I added this additional summarizeRows call to illustrate that it could be lazily evaluated.  It could return an interface that
    // keeps taking summary operations until something forces an evaluation 
    .summarizeRows(SummaryStatsFactory.getMaximum());

There's a lot to take away there but the main point I want to drive home is if it is possible to keep the core DataFrame and Series interface as simple as possible, and then use external libraries to do the manipulations.

  • Easier to learn DataForge because just need to learn the key functions to get started(The challenge is determining what functions are absolutely key)
  • Easy to extend (basically just export a function or group some together)
  • Related operations can be grouped into factors like an ML factory, a stats factory, IO factory, etc
  • Can use adapters for existing libraries so that users don't have to learn a whole new set of functions...almost like @types/...

The naming I choose was poor. Pipe seems better but DataForge has a few kinds of piping operations (such as summarizing, and grouping), that makes it complicated.

  1. Marketing/Project coordination - I'm pretty blown away by DF and that someone out there has written a book on working with data in JavaScript. In the JS sphere, you must be near the top in terms of having credentials for pushing a JavaScript stack for data exploration and manipulation. There is a void in the JS community and we need a good opinionated push showing data loading, exploration, and ML in JS along with good performance (and the performance part doesn't seem possible...yet). You have the credentials such that you could certainly bring a group of people together and drive a coordinated effort to make JS a data science contender. I bet funding could even be possible if a good case was made

I'm pretty excited about that possibility. Imagine if DF 2 is comparable to pandas in terms of performance. There are more JS programmers than python programmers. And R is a true mess for readability. JS is better for visualization given its close connection to html. Plus, the transition for exploration to production might be easier in JS than in python and definitely easier compared to R, MatLab, etc.

  1. RE: grouping....I haven't used data forge enough to know if this is possible but in my streaming library, it had both group and ungroup functions. This was super useful in terms of grouping some things, doing some work on the group, and then going back to regular operations across the whole set. I'll take a look at my API at some point and see how it relates to DF.

Added an example for future discussion in a separate thread here.

@ashleydavis
Copy link
Member Author

Some good ideas there @rat-matheson.

  1. is definitely on the cards. and I'll add that to the main. Having pluggable high-performance vector operations will definitely pave the way for the future.

  2. to some extent has already happened. DF is written in TypeScript which means that intellisense already works. DF already has multiple separate plugins (fs, indicators and plot). I'm not 100% happy with the version 1 plugin model though, I'd love to find a better way to implement DF plugins. Happy to hear more ideas on that!

  3. I'm happy to get any help possible on marketing! ;) Thanks so much for your encouragement.

  4. I don't think there is an equivalent to grouping in DF (hard to say because I'm not sure if I've used that). Please feel free to open another issue to discuss that and provide some examples of how it might work or be used.

@ashleydavis
Copy link
Member Author

I've started a new discussion for a Data-Forge core library #111.

Please give feedback there!

@tomund
Copy link

tomund commented May 6, 2021

Add support for Parquet

@ashleydavis
Copy link
Member Author

Sounds cool. Is there a JavaScript library for it?

@fpecserke
Copy link

fpecserke commented May 11, 2021

Hey! Love this. I have some ideas for marketing. (I'm not very experienced programmer)

I'm thinking about using this library and therefore using JS instead of Python for my analysis.

I would have liked to see at first sight how often is the library updated and how many people use it. Preferrably somewhere here: http://www.data-forge-js.com/

If potential new users see that the community is growing and the library is evolving, it will make them more prone to try it and participate in the growth and maybe development or funding.

I for example would love to send a small recurring donation (Patreon-style) if possible. This is something JavaScript was missing.

Let me know what you think of these ideas. Cheers!

EDIT:
I just found out about GitHub Sponsorship. It would be great if it was linked on the page as well as a possible way to support.

@ashleydavis
Copy link
Member Author

Hi @Spyrator thanks for your feedback and your sponsorship! I really appreciate it.

@mationai
Copy link

mationai commented Jun 24, 2021

Speaking of marketing and website, I want to give my 2 cents about the website. The hero banner has to go. It takes up half the screen real estate, giving the user a reason to bail before knowing how good the lib is. The image and coloring also makes it un-modern as well, giving it a very dated-library feel and image to the library.

EDIT: Sorry for the slightly off-topic note, but since the website is not OSS, this repo seems like the only place to voice my thought on it.

@ashleydavis
Copy link
Member Author

@fuzzthink is that something you can help with?

I'd be happy for someone else to rebuild the landing page. I can do HTML and CSS, but my design skills leave a lot to be desired!

@mationai
Copy link

@ashleydavis I'll be happy to help, but I'm not sure if I have time for it. But if you open source it, and state that you welcome UX changes, maybe others will jump in. Once you create a repo for it, I can take a look and we can chat in Issues there to possibly get the ball rolling with the banner change first.

@ashleydavis
Copy link
Member Author

It's already open source here: https://github.com/ashleydavis/data-forge-landing-page

But probably best to throw it away and start again!

I'm happy to create a new repo in the DF org if you want to have at it.

@e-tang
Copy link

e-tang commented Apr 30, 2022

Hi Ashley,

For this one, "Better support for statistics (e.g. linear regressions, correlation, etc)", your answer is " I'm already working through this in v1". What does it exactly mean? Have they been added in the V1? It seems that I couldn't find the corresponding APIs.

@ashleydavis
Copy link
Member Author

ashleydavis commented May 1, 2022

Yes @e-tang, I was working on this at the end of 2021. There is some documentation but it's far from complete and I hope to improve it and continue the work in the future. If you feel like contributing please do!

Here's the main page that links to other resources:
https://www.data-forge-js.com/

Here's the API documentation:
https://data-forge.github.io/data-forge-ts/

Here are the statistics functions I've added already:

@e-tang
Copy link

e-tang commented May 2, 2022

Thanks Ashley,

It would be great if we can have correlation, regression, etc.

Are these in the agenda? It will be really nice to have them.

@ashleydavis
Copy link
Member Author

Yeah I would love to add those, I'm just not sure at the moment when I'll have time to come back to it.

@bananensplit
Copy link

Hey!
I think adding support for Moment.js would be a good idea. It sure would make working with datetimes, dates and durations a lot easier. I think it is already used in the parseDates function (as mentioned here), so why just don't keep the Moment-Object and use it for further operations?

@ashleydavis
Copy link
Member Author

@bananensplit actually uses Day.js which is a smaller replacement of Moment.js.

I don't think Data-Forge needs to change for this though. You can easily convert any series to a series of Moment object. If you find any problem with please submit an issue.

@jeff-hykin
Copy link

I know this is old thread, but I think this feedback is still relevant for these:

  • marketing
  • documentation
  • plugins
  • Series and DataFrame integratation
  • lazy evaluation
  • easier to use and get started with

Quick feedback

  • marketing: Data Forge is still hidden. I'm in several JS discord / Matrix communities, asked for client side dataframe libraries, searched for days for clientside libraries, and found nothing but garbage. Not sure how to help, but FYI discoverability is a major problem.

  • documentation:

    • Please link to Core Concepts on the github ReadMe.md and website homepage. Its nearly everything I ever wanted for documentation. Took me a while to find it, but once I did it answered every question from parsing to groupBy to deflate.

    • format/convention for running transformations

      Conventions are powerful 👍 It would be nice to have a minimal core api, and then a "userland" api that builds on top of core.

  • plugins: please beware (and do anything to avoid) creating peer dependencies

    • If plugin1 is forced to import dataForge, and plugin2 is also forced to import dataForge, and the developers don't coordinate (which is inevitable), it creates unsolvable dependency hell.
    • Ways to avoid: if plugins are (somehow) a function of a dataForge object (rather than trying to import dataForge), it completely avoids this issue
    • See CodeMirror 6.x as an example of what not to do
  • Series and DataFrame integratation

    • While I appreciate minimalism, especially when Pandas, Danfo, and the like are bloated, I strongly believe combining Series and Dataframe would be a bad oversimplification
    • It doesn't make sense to take the .mean() of a dataframe, and it doesn't make sense to get the columnNames of a series
    • When rendering or printing the outputs, they should each have separate ways of being visualized; one with the column names at the top, the other with names (if they exist) localized to each item.
    • A series is basically an array with no internal structure, anything can be .pushed(), while a dataframe has a consistent internal structure. Operations such as .groupBy() make a clear transition between the two. It would be confusing to pretend they were no different.

Design Feedback (Lazy eval + Ease of use)

While I'm an imperative programmer, I think moving away from lazy evaluation (internally) would be a HUGE mistake.

There is a whole universe between eager and lazy and many application require a mix (getters, iterators, signals, etc). From an engine's point of view; its easy to implement an eager API on top of a lazy one. It only took me ~20 lines to make EagerDataFrame. It inherits from Dataframe and performs new EagerDataFrame(output.toArray()) after every method call.

However, the opposite direction; trying to implement a lazy API on top of an eager API, is a nightmare of complexity. I've done it for pandas.

Ease-of-use, I believe is an issue of transparency, control, expectations, and bulk.

  • if getColumns() is lazy, then getColumnNames() should also be lazy or renamed (ex: have it be a .columnNames getter)
  • I shouldn't be able to access df.content.columnNames on one data frame, then have it fail for df.select(row=>row.isGood).content.columnNames
  • I tried looking for a .eval()or .instantiate() so that I could do df.where(pred).eval().content.values. It would allow me to test if the lambda in .select() was going to fail or not
  • On the flip side -- where I think dataForge is doing great -- is loop iteration. for (let each of df). It feels eager, it feels easy to use, even though its actually lazy.

Before considering changing the engine I'd strongly consider:

  • helpers like .$select() that are an eager version of their non-dollarsign counterparts
  • renaming .content to ._content to show that its fickle
  • having modes like .$.where().select() where everything after the $ is eager

@ashleydavis
Copy link
Member Author

Thanks so much for your input @jeff-hykin. Getting detailed feedback like this is very motivating and I agree with practically all of what you have said. This thread is old, but if and when DF v2 goes ahead, more planning will have to be done and I'll call on you then to help.

The hardest thing is the marketing! So if you do get ideas on that please continue this thread. Happy new year!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version-2 Saving this for Data-Forge version 2.
Projects
None yet
Development

No branches or pull requests

9 participants