diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..bdaab25 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +env/ diff --git a/Procfile b/Procfile new file mode 100644 index 0000000..0a74b74 --- /dev/null +++ b/Procfile @@ -0,0 +1 @@ +web: sh setup.sh && streamlit run main.py \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..d524ee7 --- /dev/null +++ b/README.md @@ -0,0 +1,39 @@ +# Goals of the Project +From working on the paper presentations and summaries for 11-631, Data Science Seminar, I was struck with how hard it was to both build and look through the literature graph offered by the Semantic Scholar to find and see the relations between relevant papers. From doing literature surveys before, I know from firsthand experience that there does not seem to be any tools that perform the following functions in total: + +1. Allow users to explore the graph of connections between papers, hopping from one paper to another. This allows users to find papers that are directly related to the paper in question, instead of worrying about the relevance of a paper in their collection. +2. Allow users to save their history, recording what papers they go to as papers they might want to read later and allowing them to go back to the last paper they were looking at whenever possible. +3. Allow users to visualize their reading list as a graph of walks from node to node, demonstrating how they came upon the papers in question and not just the order they found them. + +Such a tool is important to me, as it would allow me to do tasks like the following: + +1. Search the literature around any particular survey paper, and see what are the most and least influential papers they cite. This would help me condense future paper reading efforts, and ensure I do not read too many papers outside of what I find interesting and relevant to my research interests. +2. Search the literature around a hallmark paper, and find papers that use its approach in novel ways. While it is easy to find papers that are cited by current work, it is not easy to do the opposite, and tools that have the opposite information do not make the graphical view of the relations from paper to paper that I am looking at clear. + +To solve this, I proposed and have created the following tool, which does the following: + +1. Allows users to search through the space of papers on Semantic Scholar through the Semantic Scholar API, visually helping them perform the Depth-First or Breath-First Search they might want to do to find papers of interest. +2. Allows users to see the citation and year distributions of papers in relation to their paper and of papers on their reading list, which helps them confirm that they have enough papers from specific time-frames for literature reviews, and helps them assess visually a paper's impact, as older papers might have a lot of references from highly cited papers, which could indicate a paper's lasting impact in the literature. +3. Allows users to download the reading list at any point, thus ensuring they can easily transition from using the tool to getting and reading the papers in question. + +# Design Decisions + +- I chose to view the literature graph as a walk on the total literature graph, namely to make it clearer to the user what papers are on their reading list and how they might have stumbled upon a paper in their survey process. +- Due to how Streamlit's Agraph component works, I have had to make the UI for navigating the graph as a separate dialog, instead of being able to click on the graph to move around on it. This also ensures that the graph does not get too messy during the survey process, as papers with over 1000 references/citations do exist in the dataset, which would be incredibly difficult to work with and visualize. +- I chose to visualize the years and citation counts as histograms, as they tend to be easily understood and easily demonstrate the years/citations of the papers in one's reading list / in the papers that cite / are referred from the currently selected paper. I had thought of doing something else here, such as a stem and leaf plot or some sort of time-line plot, but both of these options would have made it harder to see the frequencies of each year and any gaps in between years, which was the intent. +- I chose to make the basic information for the paper an expandable dialog, as if I had made the dialog fully expanded all the time, then the navigation would have been difficult to reach. I had considered putting the basic paper information after the navigation information, but that was also annoying, as it prevented me from being able to look at the paper in more depth before looking at the papers around that paper in the citation graph. + +The main way that I personally came to my design was through a lot of trial and error. I had an idea of what information I wanted to display, and how my general navigation should be, but I was unsure exactly how I was going to place it all concisely. I made a few wireframes, and then decided on the format in the submission. + +# Process + +I spent a total of 18-20 hours on the project: + +1. I started off with 2 hours of brainstorming what I was going to do, and figuring out my data source. I ended up with the Semantic Scholar API, which allowed me to look at the literature graph without needing the send the data over to the user, and allowed for the tool to be useful even if I do not update it for a long time. + +2. I then spent 4 hours exploring my data, and seeing how the formatting might affect my analysis. I had to come up with scenarios to handle cases where the Semantic Scholar Paper ID might not be present, and if the title is meaningless, and filter them accordingly. I also better understood the rate limit, and realized I would have to be careful not to query the API too much when I needed more data. + +3. I spent 4 more hours constructing the basics of my interface, finding the components I would use to make it ( Streamlit AGraph for the graph interface, e.g.), and seeing what was not intuitive with Streamlit ( like the buttons, as they only refresh the page through a callback ) + +4. I then spent 8 hours working on the interface, and testing it out to make sure that the functionality existed as intended, and that nothing was out of the ordinary. In this phase, I added features like the back-button and the list of papers to potentially explore, which were the papers that Semantic Scholar listed as "highly influenced" by the main paper. This helped improve the papers I was looking at, and made the interface feel better to look at and use. + diff --git a/main.py b/main.py new file mode 100644 index 0000000..2c822df --- /dev/null +++ b/main.py @@ -0,0 +1,247 @@ +import streamlit as st +import pandas as pd +import numpy as np +import requests +import json +import matplotlib.pyplot as plt +import plotly.express as px +import graphviz as gz +from streamlit_agraph import agraph, TripleStore, Config +from enum import Enum + +# Function Definitions + + +def updatePaper(lookatme, store): + for paper in lookatme: + r3 = querydata(paper[0], forward=False) + for elem in r3: + if elem["isInfluential"]: + store.add_triple(elem["citedPaper"]["title"], "Cited by", paper[1]) + + + +def getBasicPaperData(paperid, verbose=True): + @st.cache + def get_data(): + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/?fields=url,title,abstract,venue,year,fieldsOfStudy,authors") + output = json.loads(response.text) + return output + output = get_data() + #output + if verbose: + if "title" not in output: + st.error("Error: You have something wrong with your Semantic Scholar ID") + return "ERROR" + st.subheader(output["title"]) + st.text(", ".join(elem["name"] for elem in output["authors"])) + st.text(output["venue"] + " " + str(output["year"])) + output["abstract"] + output["url"] + return output["title"] + +def returnBasicPaperData(paperid, verbose=True): + @st.cache + def get_data(): + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/?fields=title,url,venue,year,referenceCount,citationCount") + output = json.loads(response.text) + return output + return get_data() + + +@st.cache +def querydata(paperid, forward=True, maxpages=5): + if forward: + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/citations?fields=title,citationCount,isInfluential,venue,year,influentialCitationCount") + else: + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/references?fields=title,citationCount,isInfluential,venue,year,influentialCitationCount") + response = json.loads(response.text) + total = response["data"] + counter = 1 + while "next" in response.keys(): + if forward: + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/citations?fields=title,citationCount,isInfluential,venue,year,influentialCitationCount&offset="+str(response["next"])) + else: + response = requests.get("https://api.semanticscholar.org/graph/v1/paper/"+paperid+"/references?fields=title,citationCount,isInfluential,venue,year,influentialCitationCount&offset="+str(response["next"])) + response = json.loads(response.text) + total = total + response["data"] + counter += 1 + if counter > maxpages: + break + return total + + +def display_graph(current_title): + graph = gz.Digraph() + + store = TripleStore() + store.add_triple(current_title, "", current_title) + + + for _, t1, _, t2 in st.session_state['paper_list']: + store.add_triple(t1, "", t2) + + + agraph(list(store.getNodes()), (store.getEdges() ), config = Config(height=500, width=700, nodeHighlightBehavior=True, highlightColor="#F7A7A6", directed=True, + collapsible=True)) + + st.graphviz_chart(graph) + + +def init_callback(user_input): + st.session_state['paper_hist'].append(user_input) + st.session_state['exploration_state'] = 1 + +#TODO: Add input validation to the search box, using the /paper/id system. + +def add_fd_paper_callback(forward_info, user_input, current_title, choice_forward, move=True): + choice_forward = st.session_state.choice_forward + for elem in forward_info: + if elem["citingPaper"]["title"] == choice_forward: + st.session_state['paper_list'].append((user_input, current_title, elem["citingPaper"]["paperId"], choice_forward)) + st.session_state['paper_list'] = list(set(st.session_state['paper_list'])) + if move: + st.session_state['paper_hist'].append(elem["citingPaper"]["paperId"]) + + break + +def add_bk_paper_callback(backward_info, user_input, current_title, choice_backward, move=True): + choice_backward = st.session_state.choice_backward + for elem in backward_info: + if elem["citedPaper"]["title"] == choice_backward: + st.session_state['paper_list'].append((elem["citedPaper"]["paperId"], choice_backward, user_input, current_title)) + st.session_state['paper_list'] = list(set(st.session_state['paper_list'])) + if move: + st.session_state['paper_hist'].append(elem["citedPaper"]["paperId"]) + break + + +def back_button(): + st.session_state['paper_hist'].pop() + +def paper_hist(): + out = set() + for _,a,_,b in st.session_state["paper_list"]: + out.add(a) + out.add(b) + out = "\n".join(sorted(list(out))) + return out + +def id_list(): + out = set() + for a,_,b,_ in st.session_state["paper_list"]: + out.add(a) + out.add(b) + out = list(out) + return out +if 'paper_list' not in st.session_state: + # Tuples of the form (last paper id, last paper name, current paper id, current paper name) + st.session_state['paper_list'] = [] + +if 'paper_hist' not in st.session_state: + st.session_state['paper_hist'] = [] + + +if 'exploration_state' not in st.session_state: + st.session_state['exploration_state'] = 0 + +st.title('Paper Explorer') + +if st.session_state['exploration_state'] == 0: + st.header("Welcome to the Paper Explorer! ") + st.write("Start by going to Semantic Scholar, and looking for the initial seed paper you wish to look at.") + st.write("Get the Semantic Scholar ID for this paper, and place it here.") + user_input = st.text_input("Paper Id", "6a9fa4c579bfd4fe4b1b06f384b946c5c28e1c47") + st.write("Selected Paper:") + current_title = getBasicPaperData(user_input) + if current_title != "ERROR": + use_this_paper = st.button("Choose this paper!", on_click=init_callback, args=(user_input,)) + + + +else: + user_input = st.session_state['paper_hist'][-1] + current_title = getBasicPaperData(user_input, verbose=False) + forward_info = querydata(user_input) + backward_info = querydata(user_input, forward=False) + display_graph(current_title) + + forward_titles = sorted([elem["citingPaper"]["title"] for elem in forward_info if len(elem["citingPaper"]["title"]) > 5]) + backward_titles = sorted([elem["citedPaper"]["title"] for elem in backward_info if len(elem["citedPaper"]["title"]) > 5]) + + + st.header('Paper Information') + with st.expander("Basic Information"): + getBasicPaperData(paperid=user_input) + with st.expander("Most Influential Papers Citing Current Paper"): + cleaned = [elem for elem in forward_info if elem["isInfluential"] and elem["citingPaper"]["influentialCitationCount"] != None and elem["citingPaper"]["influentialCitationCount"] >= 0] + cleaned.sort(key=lambda elem: elem["citingPaper"]["influentialCitationCount"], reverse=True) + cleaned.sort(key=lambda elem: elem["citingPaper"]["citationCount"], reverse=True) + paper_names = st.container() + plist = min(10, len(cleaned)) + for elem in range(plist): + paper_names.text(cleaned[elem]["citingPaper"]["title"],) + + with st.expander("Paper Stats"): + citations = [elem["citingPaper"]["citationCount"] for elem in forward_info if elem["citingPaper"]["citationCount"] != None] + if citations != []: + st.markdown("**The most cited paper that cites this paper is:**") + [elem for elem in forward_info if elem['citingPaper']['citationCount'] != None and elem['citingPaper']['citationCount'] == max(citations)][0]['citingPaper']['title'] + + citations = [elem["citingPaper"]["citationCount"] for elem in forward_info if elem["citingPaper"]["citationCount"] != None] + if citations != []: + fig = px.histogram(citations, title="Citation Count on Papers that Cite This Paper") + st.plotly_chart(fig) + + citations = [elem["citedPaper"]["citationCount"] for elem in backward_info if elem["citedPaper"]["citationCount"] != None] + if citations != []: + st.markdown("**The most cited paper that this paper cites is:**") + [elem for elem in backward_info if elem['citedPaper']['citationCount'] != None and elem['citedPaper']['citationCount'] == max(citations)][0]['citedPaper']['title'] + + citations = [elem["citedPaper"]["citationCount"] for elem in backward_info if elem["citedPaper"]["citationCount"] != None] + if citations != []: + fig = px.histogram(citations, title="Citation Count on the Papers that this Paper Cites") + st.plotly_chart(fig) + + + citations = [elem["citedPaper"]["year"] for elem in backward_info if elem["citedPaper"]["year"] != None] + [elem["citingPaper"]["year"] for elem in forward_info if elem["citingPaper"]["year"] != None] + if citations != []: + fig = px.histogram(citations, title="Years Published on Papers Cited and Citing the Paper") + st.plotly_chart(fig) + + st.header('Navigation') + with st.form(key="tf2"): + choice_forward = st.selectbox("Papers Citing This Paper", forward_titles, key="choice_forward") + select_forward = st.form_submit_button("Add this paper to reading list", on_click=add_fd_paper_callback, args=(forward_info, user_input, current_title, choice_forward, False)) + select_forward = st.form_submit_button("Go to this paper", on_click=add_fd_paper_callback, args=(forward_info, user_input, current_title, choice_forward, True)) + + with st.form(key="tf3"): + choice_backward = st.selectbox("Current Paper Citations", backward_titles, key="choice_backward") + select_backward = st.form_submit_button("Add this paper to reading list", on_click=add_bk_paper_callback, args=(backward_info, user_input, current_title, choice_backward, False)) + select_backward = st.form_submit_button("Go to this paper", on_click=add_bk_paper_callback, args=(backward_info, user_input, current_title, choice_backward, True)) + if len(st.session_state["paper_hist"]) > 1: + st.button("Go back to the last paper.", on_click=back_button) + + st.header('Reading List Stats') + with st.expander('Expand'): + paper_id_list = id_list() + out_l = [] + for id in paper_id_list: + if id != None: + data = returnBasicPaperData(id) + out_l.append(data) + + years = [elem["year"] for elem in out_l] + fig = px.histogram(years, title="Year Published for Papers on Reading List") + st.plotly_chart(fig) + + years = [elem["citationCount"] for elem in out_l] + fig = px.histogram(years, title="# of Citations for Papers on Reading List") + st.plotly_chart(fig) + + read_hist_str = paper_hist() + st.download_button("Download your reading list!", data=read_hist_str, file_name='ReadingList.txt') + +st.write("View source code and writeup here: https://github.com/IDSF21/assignment-2-arav-agarwal2") + + diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..3765ccb --- /dev/null +++ b/requirements.txt @@ -0,0 +1,89 @@ +altair==4.1.0 +argon2-cffi==21.1.0 +astor==0.8.1 +attrs==21.2.0 +backcall==0.2.0 +backports.zoneinfo==0.2.1 +base58==2.1.0 +bleach==4.1.0 +blinker==1.4 +cachetools==4.2.4 +certifi==2021.5.30 +cffi==1.14.6 +charset-normalizer==2.0.6 +click==7.1.2 +cycler==0.10.0 +debugpy==1.4.3 +decorator==5.1.0 +defusedxml==0.7.1 +entrypoints==0.3 +gitdb==4.0.7 +GitPython==3.1.24 +graphviz==0.17 +idna==3.2 +ipykernel==6.4.1 +ipython==7.28.0 +ipython-genutils==0.2.0 +ipywidgets==7.6.5 +jedi==0.18.0 +Jinja2==3.0.1 +jsonschema==4.0.1 +jupyter-client==7.0.5 +jupyter-core==4.8.1 +jupyterlab-pygments==0.1.2 +jupyterlab-widgets==1.0.2 +kiwisolver==1.3.2 +MarkupSafe==2.0.1 +matplotlib==3.3.4 +matplotlib-inline==0.1.3 +mistune==0.8.4 +nbclient==0.5.4 +nbconvert==6.2.0 +nbformat==5.1.3 +nest-asyncio==1.5.1 +networkx==2.6.3 +notebook==6.4.4 +numpy==1.20.0 +packaging==21.0 +pandas==1.3.3 +pandocfilters==1.5.0 +parso==0.8.2 +pexpect==4.8.0 +pickleshare==0.7.5 +Pillow==8.3.2 +plotly==5.3.1 +prometheus-client==0.11.0 +prompt-toolkit==3.0.20 +protobuf==3.18.0 +ptyprocess==0.7.0 +pyarrow==5.0.0 +pycparser==2.20 +pydeck==0.7.0 +Pygments==2.10.0 +pyparsing==2.4.7 +pyrsistent==0.18.0 +python-dateutil==2.8.2 +pytz==2021.3 +pyzmq==22.3.0 +requests==2.26.0 +Send2Trash==1.8.0 +six==1.16.0 +smmap==4.0.0 +streamlit==0.89.0 +streamlit-agraph==0.0.35 +streamlit-wordcloud==0.1.0 +tenacity==8.0.1 +terminado==0.12.1 +testpath==0.5.0 +toml==0.10.2 +toolz==0.11.1 +tornado==6.1 +traitlets==5.1.0 +typing-extensions==3.10.0.2 +tzlocal==3.0 +urllib3==1.26.7 +validators==0.18.2 +watchdog==2.1.6 +wcwidth==0.2.5 +webencodings==0.5.1 +widgetsnbextension==3.5.1 diff --git a/setup.sh b/setup.sh new file mode 100644 index 0000000..f0ab258 --- /dev/null +++ b/setup.sh @@ -0,0 +1,8 @@ +mkdir -p ~/.streamlit/ +echo "\ +[server]\n\ +headless = true\n\ +port = $PORT\n\ +enableCORS = false\n\ +\n\ +" > ~/.streamlit/config.toml \ No newline at end of file