Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Books: prevent and remove duplicates being saved to DB #115

Open
4 tasks
lefnire opened this issue Nov 19, 2020 · 0 comments
Open
4 tasks

Books: prevent and remove duplicates being saved to DB #115

lefnire opened this issue Nov 19, 2020 · 0 comments
Labels
📚Books & Resources Recommends: books, therapists, etc. help wanted Extra attention is needed

Comments

@lefnire
Copy link
Collaborator

lefnire commented Nov 19, 2020

Currently books recommender looks at Libgen database dump, finds matches, and saves the top k matches to the database. If a user thumbs a book, it gets perma-saved to the database; otherwise, next recommender run wipes the previous matches & re-saves new matches.

Libgen can have many duplicate books for a single book. This because there's different formats, editions, etc. I'm using the primary-key of book_id, which is Libgen-specific; not book-specific. Instead we should be using ISBN or another identifier as the primary_key to prevent duplicates from being saved on recommendation.

  • Find a good unique ID for Libgen books. Likely ISBN, but those might be null, or maybe there's a better ID
  • Re-do current books table to use that as primary_key
  • Write migration to clear out currently-saved books to remove duplicates; keeping the entry with the most interactions (thumbs/etc)
  • Also consider a simple Levenshtein distance check on book_title + book_author in case ISBNs are different, but the books are duplicates. Sometimes the same book's uploaded twice, with a single character change in the title
@lefnire lefnire added help wanted Extra attention is needed 📚Books & Resources Recommends: books, therapists, etc. labels Nov 19, 2020
@lefnire lefnire moved this to Beta in Gnothi Nov 6, 2022
@lefnire lefnire added this to Gnothi Nov 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📚Books & Resources Recommends: books, therapists, etc. help wanted Extra attention is needed
Projects
Status: Next
Development

No branches or pull requests

1 participant