Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape reddit comments #10

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

syltruong
Copy link
Contributor

@syltruong syltruong commented Jul 15, 2023

  • rm unused twitter_agent module
  • 1 comment == 1 document
  • Revise reddit prompt engineering to prioritise posts with high upvotes and high author karma. Reference on prompt engineering.

  • Research on whether we can better exploit Document metadata and establish a comment graph (left as an Issue)

@ahmedbesbes
Copy link
Owner

ahmedbesbes commented Jul 16, 2023

I know this is still WIP, but I gave it a try today and got this error:

Capture d’écran 2023-07-16 à 10 42 34

You should probably first check the author is not None :)

@syltruong
Copy link
Contributor Author

Thanks for trying the branch out! The error is noted. I am working on it

@syltruong
Copy link
Contributor Author

In the meantime, I am planning to revise the prompts and add some instructions along the lines of "prioritise comments with high votes from authors with high karma count".

Wanted to know the rationale in having a "stuff" and "chromadb" alternative: would we be able to use chromadb all the time? Are there downsides to that?

def summarize(self):
        if self.loaded_documents is not None:
            num_tokens = self._get_number_of_tokens()
            if num_tokens <= 4097:
                method = "stuff"
            else:
                method = "chromadb"

            with self.console.status(
                "Generating a summary of the loaded tweets ... ⌛ \n",
                spinner="aesthetic",
                speed=1.5,
                spinner_style="red",
            ):
                if method == "stuff":
                    summary = summarize_tweets(self.loaded_documents)

                elif method == "chromadb":
                    response = self.chain(
                        {
                            "question": summarization_question_template,
                        },
                    )
                    summary = response["answer"]

@ahmedbesbes
Copy link
Owner

ahmedbesbes commented Jul 16, 2023

Good question. in this context, "chroma" and "stuff" indicate summarization methods, not storage methods.

In fact, I'm using two summarization methods depending on the number of tokens (that obviously depends on the number of ingested posts/tweets)

If the number of tweets/posts is low enough that the number of tokens is lower than 4097 (a limitation of OpenAI) we use a classic summarization method based on the load_summarize_chain function: this works pretty well for a low number of posts. (eg. < 100 tweets or so)

Otherwise, we can use Chromadb as a proxy to summarize the data. (that's when the summarization_question_template comes into place)

we can maybe change "chroma" and "stuff" variable names with something more explicit

@ahmedbesbes
Copy link
Owner

Do you need help on this? @syltruong

@syltruong
Copy link
Contributor Author

Hey @ahmedbesbes sorry for being MIA on this. I still have some changes to push before marking this PR as ready for review. Should be done by eod SGT :)

@ahmedbesbes
Copy link
Owner

ahmedbesbes commented Jul 22, 2023

No problem, take your time :) and thank you for your help

@syltruong syltruong marked this pull request as ready for review July 23, 2023 08:28
@syltruong
Copy link
Contributor Author

@ahmedbesbes this is ready for review


{text}

I want you to provide a short summary and produce three questions that cover the discussed topics.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the LLM know if the post has a high number of upvotes or if the redditor has a high karma?
I think it's interesting to include this information but I'm not sure the LLMs uses it (the text only is used here).
Unless I'm missing something?


elif method == PromptMethod.retrievalqa:
ret = """\
Given the following documents, I want you to provide a short summary and produce three questions that cover the discussed topics.
Copy link
Owner

@ahmedbesbes ahmedbesbes Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's replace ret with template


def get_summarization_template(self, method: PromptMethod) -> str:
if method == PromptMethod.stuff:
ret = """\
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the sake of being explicit: could you replace ret with template?

@ahmedbesbes
Copy link
Owner

Thank you @syltruong! Just a few comments above.

I tried the app and I have some remarks:

  • loading the reddit posts is quite long. have you noticed that? this probably requires a whole issue for itself?

  • The sources are often not provided when using Reddit. This can be related to the summarization prompt. We can maybe tweak it a bit so that the generated questions have answers in the loaded data

example 1
Capture d’écran 2023-07-23 à 10 45 30

example 2
Capture d’écran 2023-07-23 à 10 41 58

example 3
Capture d’écran 2023-07-23 à 10 41 06

  • There's a warning regarding the praw version. Can we update the poetry dependencies?
Capture d’écran 2023-07-23 à 11 42 51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants