Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ It is also important that you have a recent version of the database which can be
$ docker compose up --build -d
```

3. Access the application through your web browser by going to http://localhost:8000
3. Access the application through your web browser by going to http://localhost:8000. Additionally, the backend of the server can be queried, see examples in `notebooks`.

### Useful tips

Expand Down
2 changes: 2 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ services:

db:
build: ./services/db
ports:
- "5432:5432"
volumes:
- db_data:/var/lib/postgresql/data
env_file:
Expand Down
25 changes: 16 additions & 9 deletions docs/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Disclaimer: The datasets are on the order of several 100 GBs and take a signific

In order to update the database you must download all 4 data sources and parse them.

**Upadte**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Upadte**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.
**Update**: Semantic scholar API is now behind a license. Ensure your usage fits the license specification.


1. Create a new folder in the root of the repository where all the input files will be stored with the following command:
```sh
$ export DOWNLOAD_DATE=$(date +'%d_%m_%Y')
Expand All @@ -15,26 +17,31 @@ In order to update the database you must download all 4 data sources and parse t
$ wget -P datasets/dblp_$DOWNLOAD_DATE https://dblp.uni-trier.de/xml/{dblp.xml.gz,dblp.dtd}
```

3. Download the [Open Academic Graph](https://www.aminer.org/open-academic-graph) dataset which contains the Aminer and MAG papers using the following commands:
3. **UPDATE**: aminer and mag datasets now seem to be combined into one graph. Name is kept for scripts to still function. Download the [Open Academic Graph](https://www.aminer.org/open-academic-graph) dataset which contains the Aminer and MAG papers using the following commands:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URL points to a 404.


```sh
$ wget -P datasets/aminer_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/aminer/paper/aminer_papers_{0..5}.zip
$ wget -P datasets/mag_$DOWNLOAD_DATE https://www.aminer.cn/download_data\?link\=oag-2-1/mag/paper/mag_papers_{0..16}.zip
$ wget -P datasets/aminer_$DOWNLOAD_DATE https://opendata.aminer.cn/dataset/oag_publication_{1..14}.zip
```

This will download the following files:
![img1.png](images/img1.png)

4. Download the Semantic Scholar dataset by following the [instructions](https://api.semanticscholar.org/corpus/download/) to get the latest corpus and store the files in the `s2-corpus_$DOWNLOAD_DATE` directory.
4. Download the Semantic Scholar dataset by following the [instructions](https://api.semanticscholar.org/api-docs/datasets) to get the latest corpus and store the files in the `s2-corpus_$DOWNLOAD_DATE` directory. Note that this dataset has not been tested since the API has been licensed.

5. While the files are downloading (can take days), continue with next steps that do not depend on this.

6. After downloading all the files, unzip them.
6. Install PostgreSQL, and create with sufficient amount of privilages to create a new database. After completing that, modify `database_manager.py` with the appropriate credentials.

7. After making sure all files are unzipped and stored in the same folder, change line 14 in the `renew_data_locally.py` which is located in the parser folder, to the correct path of the folder you downloaded all the files to.
7. After downloading all the files, unzip them.

8. After making sure all files are unzipped and stored in the same folder, change line 14 in the `renew_data_locally.py` which is located in the parser folder, to the correct path of the folder you downloaded all the files to.

![img3.png](images/img3.png)

8. Finally, run the `renew_data_locally.py` file.
9. Finally, install `requirements.txt` into your python environment and run the `renew_data_locally.py` file.

10. After re-parsing the whole database, make sure to add the version dates of all the downloaded sources into the database.

9. After re-parsing the whole database, make sure to add the version dates of all the downloaded sources into the database.
![img4.png](images/img4.png)

![img4.png](images/img4.png)
11. Dump the database into a backup file `pg_dump aip > data.backup`
Binary file modified docs/datasets/images/img1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/datasets/images/old_img1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## General usage
1. Modify `db_setup.py` to have the necessary credidentials to access the database.
2. Setup a python environment for running jupiter notebooks (recommended: VS Code or VSCodium).

**Note:** Some of the notebooks are outdated, hence may need some manual adjustments to work.
213 changes: 96 additions & 117 deletions notebooks/author_citation_count_vs_clique_size.ipynb

Large diffs are not rendered by default.

Loading