Scrape reddit comments #10

syltruong · 2023-07-15T10:07:01Z

rm unused twitter_agent module
1 comment == 1 document
Revise reddit prompt engineering to prioritise posts with high upvotes and high author karma. Reference on prompt engineering.

Research on whether we can better exploit Document metadata and establish a comment graph (left as an Issue)

ahmedbesbes · 2023-07-16T08:45:21Z

I know this is still WIP, but I gave it a try today and got this error:

You should probably first check the author is not None :)

syltruong · 2023-07-16T09:18:46Z

Thanks for trying the branch out! The error is noted. I am working on it

syltruong · 2023-07-16T09:20:57Z

In the meantime, I am planning to revise the prompts and add some instructions along the lines of "prioritise comments with high votes from authors with high karma count".

Wanted to know the rationale in having a "stuff" and "chromadb" alternative: would we be able to use chromadb all the time? Are there downsides to that?

def summarize(self):
        if self.loaded_documents is not None:
            num_tokens = self._get_number_of_tokens()
            if num_tokens <= 4097:
                method = "stuff"
            else:
                method = "chromadb"

            with self.console.status(
                "Generating a summary of the loaded tweets ... ⌛ \n",
                spinner="aesthetic",
                speed=1.5,
                spinner_style="red",
            ):
                if method == "stuff":
                    summary = summarize_tweets(self.loaded_documents)

                elif method == "chromadb":
                    response = self.chain(
                        {
                            "question": summarization_question_template,
                        },
                    )
                    summary = response["answer"]

ahmedbesbes · 2023-07-16T09:31:22Z

Good question. in this context, "chroma" and "stuff" indicate summarization methods, not storage methods.

In fact, I'm using two summarization methods depending on the number of tokens (that obviously depends on the number of ingested posts/tweets)

If the number of tweets/posts is low enough that the number of tokens is lower than 4097 (a limitation of OpenAI) we use a classic summarization method based on the load_summarize_chain function: this works pretty well for a low number of posts. (eg. < 100 tweets or so)

Otherwise, we can use Chromadb as a proxy to summarize the data. (that's when the summarization_question_template comes into place)

we can maybe change "chroma" and "stuff" variable names with something more explicit

ahmedbesbes · 2023-07-22T09:34:24Z

Do you need help on this? @syltruong

syltruong · 2023-07-22T09:36:02Z

Hey @ahmedbesbes sorry for being MIA on this. I still have some changes to push before marking this PR as ready for review. Should be done by eod SGT :)

ahmedbesbes · 2023-07-22T09:37:27Z

No problem, take your time :) and thank you for your help

syltruong · 2023-07-23T08:38:00Z

@ahmedbesbes this is ready for review

ahmedbesbes · 2023-07-23T09:09:09Z

src/utils/prompts.py

+
+{text}
+
+I want you to provide a short summary and produce three questions that cover the discussed topics.


How does the LLM know if the post has a high number of upvotes or if the redditor has a high karma?
I think it's interesting to include this information but I'm not sure the LLMs uses it (the text only is used here).
Unless I'm missing something?

ahmedbesbes · 2023-07-23T09:09:56Z

src/utils/prompts.py

+
+        elif method == PromptMethod.retrievalqa:
+            ret = """\
+Given the following documents, I want you to provide a short summary and produce three questions that cover the discussed topics.


Same here, let's replace ret with template

ahmedbesbes · 2023-07-23T09:11:14Z

src/utils/prompts.py

+
+    def get_summarization_template(self, method: PromptMethod) -> str:
+        if method == PromptMethod.stuff:
+            ret = """\


for the sake of being explicit: could you replace ret with template?

ahmedbesbes · 2023-07-23T09:45:57Z

Thank you @syltruong! Just a few comments above.

I tried the app and I have some remarks:

loading the reddit posts is quite long. have you noticed that? this probably requires a whole issue for itself?
The sources are often not provided when using Reddit. This can be related to the summarization prompt. We can maybe tweak it a bit so that the generated questions have answers in the loaded data

example 1

example 2

example 3

There's a warning regarding the praw version. Can we update the poetry dependencies?

syltruong added 5 commits July 15, 2023 16:36

rm unused module twitter_agent

d3294ce

load daily top subs from subreddits, instead of all-time tops

74a16dc

add comment to document process method

acf64e1

update docstring

6f36d04

use comment formatting in main sub processing loop

1ae44cd

syltruong added 2 commits July 16, 2023 17:08

use chromadb.load_from_documents

f6dda95

raise exception for deleted comments, add source in doc metadata

d8285d7

syltruong added 5 commits July 23, 2023 14:55

introduce prompt generator class

8579b11

improve handling of banned accounts

744d58c

select limit n_comments per submission

5921c5a

include summarise chain refactoring in agent class

6e3721c

change document fomatting for reddit comments

79a185a

syltruong marked this pull request as ready for review July 23, 2023 08:28

correct typo

aac3bf4

ahmedbesbes reviewed Jul 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape reddit comments #10

Scrape reddit comments #10

syltruong commented Jul 15, 2023 •

edited

Loading

ahmedbesbes commented Jul 16, 2023 •

edited

Loading

syltruong commented Jul 16, 2023

syltruong commented Jul 16, 2023

ahmedbesbes commented Jul 16, 2023 •

edited

Loading

ahmedbesbes commented Jul 22, 2023

syltruong commented Jul 22, 2023

ahmedbesbes commented Jul 22, 2023 •

edited

Loading

syltruong commented Jul 23, 2023

ahmedbesbes Jul 23, 2023

ahmedbesbes Jul 23, 2023 •

edited

Loading

ahmedbesbes Jul 23, 2023

ahmedbesbes commented Jul 23, 2023


		{text}

		I want you to provide a short summary and produce three questions that cover the discussed topics.

Scrape reddit comments #10

Are you sure you want to change the base?

Scrape reddit comments #10

Conversation

syltruong commented Jul 15, 2023 • edited Loading

ahmedbesbes commented Jul 16, 2023 • edited Loading

syltruong commented Jul 16, 2023

syltruong commented Jul 16, 2023

ahmedbesbes commented Jul 16, 2023 • edited Loading

ahmedbesbes commented Jul 22, 2023

syltruong commented Jul 22, 2023

ahmedbesbes commented Jul 22, 2023 • edited Loading

syltruong commented Jul 23, 2023

ahmedbesbes Jul 23, 2023

Choose a reason for hiding this comment

ahmedbesbes Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

ahmedbesbes Jul 23, 2023

Choose a reason for hiding this comment

ahmedbesbes commented Jul 23, 2023

syltruong commented Jul 15, 2023 •

edited

Loading

ahmedbesbes commented Jul 16, 2023 •

edited

Loading

ahmedbesbes commented Jul 16, 2023 •

edited

Loading

ahmedbesbes commented Jul 22, 2023 •

edited

Loading

ahmedbesbes Jul 23, 2023 •

edited

Loading