Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📋 Database contribution Google Form fields #33

Open
JamesAlfonse opened this issue Nov 28, 2024 · 5 comments
Open

📋 Database contribution Google Form fields #33

JamesAlfonse opened this issue Nov 28, 2024 · 5 comments
Labels
question Further information is requested

Comments

@JamesAlfonse
Copy link
Member

Recently, a contribution form link was added to the database website as suggested by greengiantyo. This form was created over a year ago, and I thought now would be a good opportunity to revisit it and improve upon it.

Some thoughts:

  • I don't want potential submitters to feel like there's an overwhelming amount of fields to fill out
  • Emails are not required, there's an optional field for an Alias
  • Should fields be voted on? Do we cap the amount of fields, or vote on an amount? Do we vote on potential field options?

Screenshot 2024-11-25 at 5 36 49 AM

@JamesAlfonse JamesAlfonse moved this to Uncategorized in Issues Tracker Nov 28, 2024
@JamesAlfonse JamesAlfonse added the question Further information is requested label Nov 28, 2024
@JamesAlfonse JamesAlfonse removed the status in Issues Tracker Nov 28, 2024
@tehchives
Copy link
Collaborator

Agreed, great opportunity. I'm interested to hear what others have to say. For myself, I think the concern about it being overwhelming could be a valid one but I also think we've taken measures to try and combat that while keeping things simple. We only require the symbol / company name to help identify where the data needs to go, and the other context fields are filled in as much or as little as the submitter would like.

We have talked about various possible expansions for the type of data which is covered by the database, and if we did add more columns this particular layout isn't very scalable. However, there's a huge advantage to keeping the data separated by question and clearly labeled as it makes it easier to import to the live database.

@JFWooten4
Copy link
Member

JFWooten4 commented Dec 2, 2024

Thanks for bringing this up, @JamesAlfonse—been pondering it for a while. Firstly, I do not think we need to re-link to the legacy Google sheet in the form description. I personally found this a distraction when contributing from the new site, since I’m the kind of person that always tries to read all the linked content. But in this case, as is common on lower-end hardware, the page just hangs after loading forever. If I were a new contributor, this would make me worry about putting in info that’s already known.

Modifications

That said, this is a nuanced challenge, as I am certainly not as familiar with database design as you are. That said, I’ve tried to look through how to work with the present .db file in a native browser—unsuccessfully. Since I am most commonly used to directly editing the source code of a page (like a markdown Docusaurus file), I’m a little out of my depth when it comes to directly editing this DB content. Might I ask how the data can be managed versus how it’s used today? As I understand it, this is in an RDBMS. How might users get a link to directly propose an edit to one entry, if that’s even feasible? I’m used to working with NoSQL DBs where you just pull up the individual entry by partition key.

I bring it up since a unique identifier could automatically propagate whatever change form we use. For example, say you find a company after searching and click a pencil edit button next to a field or on a blank. This could bring you to the form with an object ID pre-populated, which could be mapped out by the backend to display a ticker, name, or symbol.

Attributions

As for the identification, Bur and I had a decent chat on this with Ankit last DC episode. The group there had some concerns, as do I, about arbitrary edit recommendations from anonymous users or IP addresses, as the case may be. I personally think this makes the integration process much more challenging. Namely, we can all believe that an edit from Chives, for example, is probably true and shouldn’t require much scrutiny. However, we might want to double-check new sources from Green if we’re not familiar with an authoritative link between their GitHub account and Discord, for example.

If the changes stay anonymous (rather than pseudonymous), then we could risk all incoming edits taking 5–10 minutes of diligent fact-checking by community users with write access to the DB. However, what if we associate changes with something like a GitHub account where we can transparently archive and rely on the hard work behind past data improvements? Then we could identify users like yourself who routinely add new info on a good-faith effort. Could this basis practically streamline automating new change requests as an authoritative source, with an understanding that any bad actions could be referenced and remedied as they are permanently recorded in something like a Git history?

Fields

I don’t think we necessarily need to have the DUNA vote on DB fields, as that seems like a relatively nuanced implementation detail local to this specific repository. Rather, I think we should be able to adequately reach consensus through informed public dialogue through chats like these. If there ever comes up a material disagreement between a large number of community members, then sure, certainly that might be worthy of a vote. But it’s my interpretation of private communications from a number of community members that the majority of participants naturally defer to the domain expertise of the growing number of GitHub contributors, especially those who’ve proven the dedication to attain write access.1

I’m with Chives in that I’d love to hear a little more from the actual contributors like yourself entering this data more frequently. How have they found the present field layout? Are they generally updating relatively complete fields like home exchange or CUSIP? How do they feel about the common capitalization of shared variables? Albeit this could be challenging to do without direct attribution to generous anonymous contributors.

Footnotes

  1. I do think it’s a good idea to do user management permissions through voting.

@JamesAlfonse
Copy link
Member Author

Appreciate your response @JFWooten4 and your attention to detail. I totally missed the broken link and have updated it accordingly; it now links to database.whydrs.org

As for the .db file, to provide some context I originally chose this method months ago as a result of a conversation with chatgpt where I asked for recommendations to replace Google Sheets. I found it easy to use as a file that can be updated consistently with scripts + GH actions as well as inherently providing version control via GH commit history. It's had its own issues (like the current one where it's unable to update tables with data from the SEC), and I am open to alternatives or better ways of implementation. The way that I've been troubleshooting the file is by downloading it from Github, opening it using DB Browser for SQLite, making modifications, then uploading the .db file back to Github.

Your thoughts on implementing a way for users to easily "edit" the database with a pencil icon, or something similar, are a great idea and I'd love to be able to include something like that. I'm just having trouble envisioning how this would be executed, both considering the challenges you and Bur have brought up and how to provide a user-friendly proof reading experience for helpers who want to assist with reviewing contributions. I'm hoping this will become more clear over time. (Side Note: One thought that just occurred to me, we should label the columns that are filled out through scripts/automations vs. ones that include data that is manually collected).

For the linked google form, below is a screenshot of all the responses received. We had more success with responses on other forms that were more specialized (e.g. asking for IR emails specifically)

Screenshot 2024-12-02 at 9 46 02 PM

@JamesAlfonse JamesAlfonse moved this to Todo in Issues Tracker Dec 3, 2024
@tehchives
Copy link
Collaborator

If the changes stay anonymous (rather than pseudonymous), then we could risk all incoming edits taking 5–10 minutes of diligent fact-checking by community users with write access to the DB. However, what if we associate changes with something like a GitHub account where we can transparently archive and rely on the hard work behind past data improvements? Then we could identify users like yourself who routinely add new info on a good-faith effort. Could this basis practically streamline automating new change requests as an authoritative source, with an understanding that any bad actions could be referenced and remedied as they are permanently recorded in something like a Git history?

I think this is a direction it may be very beneficial to move in, in context of eventually distributing tokens to contributors on a continuous basis. Anonymous contributions don't have to be disabled, but some kind of identification metric would be valuable. Initially @JamesAlfonse had built into the google sheet something of a static identifier to allow for users to submit anonymously but the sheet could still track which contributions came from which contributor. Maybe we could normalize (eventually) prompting additionally for users to supply a public key (optional) and/or an associated Git login (also optional) but stress that while we appreciate the contributions to the database, providing either or both of those datapoints allows for more direct feedback to the effort itself beyond data entry.

I think a good and separate point being made here is how much more successful data entry was when it was specifically focused on one item. The IR fields, and the tracking metrics that were built to show progress, really seemed to gameify and drive interest. Maybe we could do something similar on a smaller scale, an off the cuff example being trying to get the info for tickers starting with J during the month of January and tracking it in a similar way.

But it’s my interpretation of private communications from a number of community members that the majority of participants naturally defer to the domain expertise of the growing number of GitHub contributors, especially those who’ve proven the dedication to attain write access.

@JFWooten4 This is sensible, and it's an interstanding I'm slowly reaching myself as the time goes by! As long as people know how to get involved when they want to, it's not necessary to impede movement just because they aren't involved at this moment. We're fortunate to have developed a good reputation and trust from the community.

@JFWooten4
Copy link
Member

JFWooten4 commented Dec 6, 2024

It’s not necessary to impede movement just because they aren’t involved at this moment.

I wonder about the granularity we want to dive into for every item here. Of course, the last thing I think of is a formal “this is large enough for a vote” and “this is too small to discuss.” But I also highly value the independence of any contributors to figure out how they think to best approach a challenge.

More explicitly, I’ve seen that it’s common for reference implementations embodied in PRs to be generally accepted as a positive contribution no matter the idealistic alignment, so to speak, thereof. As we think about potentially new people joining in on native open-source work, I wonder where to draw the “line” for responsibility for domain expertise approval. Principles can ring truly central in decentralized systems, and I contemplate how the community might maintain agility while driving longstanding accountable development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Backlog
Development

No branches or pull requests

3 participants