atlarge-research · kazemaksOG · Feb 23, 2025 · Jun 14, 2025 · Jun 14, 2025 · gillesmag
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ It is also important that you have a recent version of the database which can be
 $ docker compose up --build -d
 ```
 
-3. Access the application through your web browser by going to http://localhost:8000
+3. Access the application through your web browser by going to http://localhost:8000. Additionally, the backend of the server can be queried, see examples in `notebooks`.
 
 ### Useful tips
 

diff --git a/docker-compose.yml b/docker-compose.yml
@@ -21,6 +21,8 @@ services:
 
   db:
     build: ./services/db
+    ports:
+      - "5432:5432"
     volumes:
       - db_data:/var/lib/postgresql/data
     env_file:

diff --git a/docs/datasets/README.md b/docs/datasets/README.md
@@ -4,6 +4,8 @@ Disclaimer: The datasets are on the order of several 100 GBs and take a signific
 
 In order to update the database you must download all 4 data sources and parse them. 
 
+**Upadte**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
-**Upadte**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
+**Update**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
-**Upadte**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
+**Update**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
+
 1. Create a new folder in the root of the repository where all the input files will be stored with the following command:
     ```sh
     $ export DOWNLOAD_DATE=$(date +'%d_%m_%Y')
@@ -15,26 +17,31 @@ In order to update the database you must download all 4 data sources and parse t
     $ wget -P datasets/dblp_$DOWNLOAD_DATE https://dblp.uni-trier.de/xml/{dblp.xml.gz,dblp.dtd}
     ```
 
-3. Download the [Open Academic Graph](https://www.aminer.org/open-academic-graph) dataset which contains the Aminer and MAG papers using the following commands:
+3. **UPDATE**: aminer and mag datasets now seem to be combined into one graph. Name is kept for scripts to still function. Download the [Open Academic Graph](https://www.aminer.org/open-academic-graph) dataset which contains the Aminer and MAG papers using the following commands:
 
     ```sh
-    $ wget -P datasets/aminer_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/aminer/paper/aminer_papers_{0..5}.zip
-    $ wget -P datasets/mag_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/mag/paper/mag_papers_{0..16}.zip
+    $ wget -P datasets/aminer_$DOWNLOAD_DATE https://opendata.aminer.cn/dataset/oag_publication_{1..14}.zip
     ```
 
    This will download the following files:
    ![img1.png](images/img1.png)
 
-4. Download the Semantic Scholar dataset by following the [instructions](https://api.semanticscholar.org/corpus/download/) to get the latest corpus and store the files in the `s2-corpus_$DOWNLOAD_DATE` directory.
+4. Download the Semantic Scholar dataset by following the [instructions](https://api.semanticscholar.org/api-docs/datasets) to get the latest corpus and store the files in the `s2-corpus_$DOWNLOAD_DATE` directory. Note that this dataset has not been tested since the API has been licensed.
+
+5. While the files are downloading (can take days), continue with next steps that do not depend on this.
 
-6. After downloading all the files, unzip them.
+6. Install PostgreSQL, and create with sufficient amount of privilages to create a new database. After completing that, modify `database_manager.py` with the appropriate credentials.
 
-7. After making sure all files are unzipped and stored in the same folder, change line 14 in the `renew_data_locally.py` which is located in the parser folder, to the correct path of the folder you downloaded all the files to.
+7. After downloading all the files, unzip them.
+
+8. After making sure all files are unzipped and stored in the same folder, change line 14 in the `renew_data_locally.py` which is located in the parser folder, to the correct path of the folder you downloaded all the files to.
 
    ![img3.png](images/img3.png)
 
-8. Finally, run the `renew_data_locally.py` file.
+9. Finally, install `requirements.txt` into your python environment and run the `renew_data_locally.py` file.
+
+10. After re-parsing the whole database, make sure to add the version dates of all the downloaded sources into the database.
 
-9. After re-parsing the whole database, make sure to add the version dates of all the downloaded sources into the database.
+   ![img4.png](images/img4.png)
 
-   ![img4.png](images/img4.png)
+11. Dump the database into a backup file `pg_dump aip > data.backup`
diff --git a/docs/datasets/images/img1.png b/docs/datasets/images/img1.png
diff --git a/docs/datasets/images/old_img1.png b/docs/datasets/images/old_img1.png
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -0,0 +1,5 @@
+## General usage
+1. Modify `db_setup.py` to have the necessary credidentials to access the database.
+2. Setup a python environment for running jupiter notebooks (recommended: VS Code or VSCodium).
+
+**Note:** Some of the notebooks are outdated, hence may need some manual adjustments to work.
diff --git a/notebooks/author_citation_count_vs_clique_size.ipynb b/notebooks/author_citation_count_vs_clique_size.ipynb