Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MBin Magazine Support (+ remove KBin support) #204

Merged
merged 15 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
[![publish-pages](https://github.com/tgxn/lemmy-explorer/actions/workflows/publish-pages.yaml/badge.svg)](https://github.com/tgxn/lemmy-explorer/actions/workflows/publish-pages.yaml)

# Lemmy Explorer https://lemmyverse.net/

Data Dumps: https://data.lemmyverse.net/

This project provides a simple way to explore Lemmy Instances and Communities.

![List of Communities](./docs/images/communities.png)

The project consists of four modules:

1. Crawler (NodeJS, Redis) `/crawler`
2. Frontend (ReactJS, MUI Joy, TanStack) `/frontend`
3. Deploy (Amazon CDK v2) `/cdk`
Expand All @@ -20,16 +22,18 @@ The project consists of four modules:
You can append `home_url` and (optionally) `home_type` to the URL to set the home instance and type.

`?home_url=lemmy.example.com`
`?home_url=kbin.example.com&home_type=kbin`
`?home_url=mbin.example.com&home_type=mbin`

> `home_type` supports "lemmy" and "kbin" (default is "lemmy")
> `home_type` supports "lemmy" and "mbin" (default is "lemmy")

### Q: **How does discovery work?**

It uses a [seed list of communities](https://github.com/tgxn/lemmy-explorer/blob/main/crawler/src/lib/const.js#L47) and scans the equivalent of the `/instances` federation lists, and then creates jobs to scan each of those servers.

Additionally, instance tags and trust data is fetched from [Fediseer](https://gui.fediseer.com/).

### Q: **How does the NSFW filter work?**

The NSFW filter is a client-side filter that filters out NSFW communities and instances from results by default.
The "NSFW Toggle" checkbox has thress states that you can toggle through:
| State | Filter | Value |
Expand All @@ -38,17 +42,18 @@ The "NSFW Toggle" checkbox has thress states that you can toggle through:
| One Click | Include NSFW | null |
| Two Clicks | NSFW Only | true |

When you try to switch to a non-sfw state, a popup will appear to confirm your choice. You can save your response in your browsers cache and it will be remembered.

When you try to switch to a non-sfw state, a popup will appear to confirm your choice. You can save your response in your browsers cache and it will be remembered.

### Q: **How long till my instance shows up?**

How long it takes to discover a new instance can vary depending on if you post content that's picked up by one of these servers.

Since the crawler looks at lists of federated instances, we can't discover instances that aren't on those lists.

Additionally, the lists are cached for 24 hours, so it can take up to 24 hours for an instance to show up after it's been discovered till it shows up.

### Q: **Can I use your data in my app/website/project?**

I do not own any of the data retrieved by the crawler, it is available from public endpoints on the source instances.

You are free to pull data from the GitHub pages site:
Expand All @@ -66,21 +71,21 @@ You can also download [Latest ZIP](https://nightly.link/tgxn/lemmy-explorer/work

`dist-json-bundle.zip` file contains the data in JSON format:

- `communities.full.json` - list of all communities
- `instances.full.json` - list of all instances
- `overview.json` - metadata and counts

- `communities.full.json` - list of all communities
- `instances.full.json` - list of all instances
- `overview.json` - metadata and counts

## Crawler

[Crawler README](./crawler/README.md)

## Frontend

[Frontend README](./frontend/README.md)

## Data Site
[Data Site README](./pages/README.md)


[Data Site README](./pages/README.md)

## Deploy

Expand All @@ -90,9 +95,6 @@ The deploy is an Amazon CDK v2 project that deploys the crawler and frontend to

then run `cdk deploy --all` to deploy the frontend to AWS.




## Similar Sites

- https://browse.feddit.de/
Expand All @@ -102,8 +104,8 @@ then run `cdk deploy --all` to deploy the frontend to AWS.
- https://browse.toast.ooo/
- https://lemmyfind.quex.cc/


## Lemmy Stats Pages

- https://lemmy.fediverse.observer/dailystats
- https://the-federation.info/platform/73
- https://fedidb.org/software/lemmy
Expand All @@ -117,4 +119,3 @@ then run `cdk deploy --all` to deploy the frontend to AWS.
# Credits

Logo made by Andy Cuccaro (@andycuccaro) under the CC-BY-SA 4.0 license.

33 changes: 17 additions & 16 deletions crawler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ These immediately run a specific task.
| `--init` | Initialize queue with seed jobs |
| `--health` | Check worker health |
| `--aged` | Create jobs for aged instances and communities |
| `--kbin` | Create jobs for kbin communities |
| `--mbin` | Create jobs for mbin communities |
| `--uptime` | Immediately crawl uptime data |
| `--fedi` | Immediately crawl Fediseer data |

Expand All @@ -73,7 +73,7 @@ These start a worker that will run continuously, processing jobs from the releva
| `-w instance` | Crawl instances from the queue |
| `-w community` | Crawl communities from the queue |
| `-w single` | Crawl single communities from the queue |
| `-w kbin` | Crawl kbin communities from the queue |
| `-w mbin` | Crawl mbin communities from the queue |
| `-w cron` | Schedule all CRON jobs for aged instances and communities, etc |

#### **Examples**
Expand All @@ -94,7 +94,7 @@ These start a worker that will run a single job, then exit.
| `-m [i\|instance] <base_url>` | Crawl a single instance |
| `-m [c\|community] <base_url>` | Crawl a single instance's community list |
| `-m [s\|single] <base_url> <community_name>` | Crawl a single community, delete if not exists |
| `-m [k\|kbin] <base_url>` | Crawl a single community |
| `-m [m\|mbin] <base_url>` | Crawl a single mbin instance |

#### **Examples**

Expand Down Expand Up @@ -126,7 +126,7 @@ Crawlers are tasks created to perform an action, which could be crawling an inst
| `community` | Community Crawling |
| `fediseer` | Fediseer Crawling |
| `uptime` | Uptime Crawling |
| `kbin` | Kbin Crawling |
| `mbin` | MBin Crawling |

### Queues

Expand All @@ -137,7 +137,7 @@ Queues are where Tasks can be placed to be processed.
| `instance` | Crawl an instance |
| `community_list` | Crawl a community |
| `community_single` | Crawl a single community |
| `kbin` | Crawl a kbin community |
| `mbin` | Crawl a mbin instance |

## Storage

Expand All @@ -146,16 +146,17 @@ Redis is used to store crawled data.
You can use `docker compose up -d` to start a local redis server.
Data is persisted to a `.data/redis` directory.

| Redis Key | Description |
| -------------- | ----------------------------------------------------- |
| `attributes:*` | Tracked attribute sets _(change over time)_ |
| `community:*` | Community details |
| `deleted:*` | Deleted data _(recycle bin if something broken)_ |
| `error:*` | Exception details |
| `fediverse:*` | Fediverse data |
| `instance:*` | Instance details |
| `last_crawl:*` | Last crawl time for instances and communities |
| `magazine:*` | Magazine data _(kbin magazines)_ |
| `uptime:*` | Uptime data _(fetched from `api.fediverse.observer`)_ |
| Redis Key | Description |
| ----------------- | ----------------------------------------------------- |
| `attributes:*` | Tracked attribute sets _(change over time)_ |
| `community:*` | Community details |
| `deleted:*` | Deleted data _(recycle bin if something broken)_ |
| `error:*` | Exception details |
| `fediverse:*` | Fediverse data |
| `instance:*` | Instance details |
| `last_crawl:*` | Last crawl time for instances and communities |
| `mbin_instance:*` | MBin Instances |
| `magazine:*` | Magazine data _(mbin magazines)_ |
| `uptime:*` | Uptime data _(fetched from `api.fediverse.observer`)_ |

Most of the keys have sub keys for the instance `base_url` or community `base_url:community_name`.
8 changes: 4 additions & 4 deletions crawler/ecosystem.config.cjs
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ module.exports = {
},
{
...defaultOptions,
output: "./.data/logs/kbin.log",
name: "crawl-kbin",
args: ["-w", "kbin"],
instances: 4,
output: "./.data/logs/mbin.log",
name: "crawl-mbin",
args: ["-w", "mbin"],
instances: 2,
},
],
};
14 changes: 7 additions & 7 deletions crawler/src/bin/manual.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ import logging from "../lib/logging";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

export default async function runManualWorker(workerName: string, firstParam: string, secondParam: string) {
// scan one instance
Expand Down Expand Up @@ -36,12 +36,12 @@ export default async function runManualWorker(workerName: string, firstParam: st
});
}

// scan one kbin
else if (workerName == "k" || workerName == "kbin") {
logging.info(`Running Singel Q Scan KBIN Crawl for ${firstParam}`);
const crawlKBinManual = new KBinQueue(true, "kbin_manual");
await crawlKBinManual.createJob(firstParam, (resultData) => {
logging.info("KBIN Crawl Complete");
// scan one mbin
else if (workerName == "m" || workerName == "mbin") {
logging.info(`Running Single MBin Crawl for ${firstParam}`);
const crawlMBinManual = new MBinQueue(true, "mbin_manual");
await crawlMBinManual.createJob(firstParam, (resultData) => {
logging.info("MBIN Crawl Complete");
process.exit(0);
});
}
Expand Down
20 changes: 10 additions & 10 deletions crawler/src/bin/task.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@ import storage from "../lib/crawlStorage";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

import CrawlOutput from "../output/output";
import { syncCheckpoint } from "../output/sync_s3";

import CrawlUptime from "../crawl/uptime";
import CrawlFediseer from "../crawl/fediseer";
import CrawlKBin from "../crawl/kbin";
import CrawlMBin from "../crawl/mbin";

import CrawlAged from "../util/aged";
import Failures from "../util/failures";
Expand Down Expand Up @@ -99,11 +99,11 @@ export default async function runTask(taskName: string) {
...commSingleCounts,
});

const kbinQHealthCrawl = new KBinQueue(false);
const kbinQHeCounts = await kbinQHealthCrawl.queue.checkHealth();
const mbinQHealthCrawl = new MBinQueue(false);
const mbinQHeCounts = await mbinQHealthCrawl.queue.checkHealth();
healthData.push({
queue: "KBinQueue",
...kbinQHeCounts,
queue: "MBinQueue",
...mbinQHeCounts,
});

console.info("Queue Health Metrics");
Expand All @@ -122,10 +122,10 @@ export default async function runTask(taskName: string) {

break;

// create jobs for all known kbin instances
case "kbin":
const kbinScan = new CrawlKBin();
await kbinScan.createJobsAllKBin();
// create jobs for all known mbin instances
case "mbin":
const mbinScan = new CrawlMBin();
await mbinScan.createJobsAllMBin();

break;

Expand Down
23 changes: 12 additions & 11 deletions crawler/src/bin/worker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@ import storage from "../lib/crawlStorage";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

// used to create scheduled instance checks
import CrawlAged from "../util/aged";
import CrawlFediseer from "../crawl/fediseer";
import CrawlUptime from "../crawl/uptime";
import CrawlKBin from "../crawl/kbin";
import CrawlMBin from "../crawl/mbin";

import { syncCheckpoint } from "../output/sync_s3";

Expand All @@ -35,9 +36,9 @@ export default async function startWorker(startWorkerName: string) {
} else if (startWorkerName == "single") {
logging.info("Starting SingleCommunityQueue Processor");
new SingleCommunityQueue(true);
} else if (startWorkerName == "kbin") {
logging.info("Starting KBinQueue Processor");
new KBinQueue(true);
} else if (startWorkerName == "mbin") {
logging.info("Starting MBinQueue Processor");
new MBinQueue(true);
}

// cron worker
Expand Down Expand Up @@ -70,14 +71,14 @@ export default async function startWorker(startWorkerName: string) {
});
}

// shares CRON_SCHEDULES.KBIN
logging.info("Creating KBin Cron Task", CRON_SCHEDULES.KBIN);
cron.schedule(CRON_SCHEDULES.KBIN, async (time) => {
console.log("Running KBin Cron Task", time);
// shares CRON_SCHEDULES.MBIN
logging.info("Creating MBin Cron Task", CRON_SCHEDULES.MBIN);
cron.schedule(CRON_SCHEDULES.MBIN, async (time) => {
console.log("Running MBin Cron Task", time);
await storage.connect();

const kbinScan = new CrawlKBin();
await kbinScan.createJobsAllKBin();
const mbinScan = new CrawlMBin();
await mbinScan.createJobsAllMBin();

await storage.close();
});
Expand Down
3 changes: 2 additions & 1 deletion crawler/src/crawl/fediseer.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import logging from "../lib/logging";
import storage from "../lib/crawlStorage";
import { IFediseerInstanceData, IFediseerTag } from "../lib/storage/fediseer";

import { IFediseerInstanceData, IFediseerTag } from "../../../types/storage";

import CrawlClient from "../lib/CrawlClient";

Expand Down
18 changes: 9 additions & 9 deletions crawler/src/crawl/instance.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import {
IErrorDataKeyValue,
ILastCrawlData,
ILastCrawlDataKeyValue,
} from "../lib/storage/tracking";
} from "../../../types/storage";

import { CRAWL_AGED_TIME } from "../lib/const";
import { HTTPError, CrawlError, CrawlTooRecentError } from "../lib/error";
Expand All @@ -19,21 +19,21 @@ import InstanceQueue from "../queue/instance";

import CrawlClient from "../lib/CrawlClient";

import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

export default class InstanceCrawler {
private crawlDomain: string;
private logPrefix: string;

private kbinQueue: KBinQueue;
private mbinQueue: MBinQueue;

private client: CrawlClient;

constructor(crawlDomain: string) {
this.crawlDomain = crawlDomain;
this.logPrefix = `[Instance] [${this.crawlDomain}]`;

this.kbinQueue = new KBinQueue(false);
this.mbinQueue = new MBinQueue(false);

this.client = new CrawlClient();
}
Expand Down Expand Up @@ -109,10 +109,10 @@ export default class InstanceCrawler {
// store all fediverse instance software for easy metrics
await storage.fediverse.upsert(this.crawlDomain, nodeInfo.software);

// scan kbin instances that are found
if (nodeInfo.software.name == "kbin") {
console.log(`${this.crawlDomain}: found kbin instance - creating job`);
await this.kbinQueue.createJob(this.crawlDomain);
// scan mbin instances that are found
if (nodeInfo.software.name == "mbin") {
console.log(`${this.crawlDomain}: found mbin instance - creating job`);
await this.mbinQueue.createJob(this.crawlDomain);
}

// only allow lemmy instances
Expand Down Expand Up @@ -325,7 +325,7 @@ export const instanceProcessor: IJobProcessor = async ({ baseUrl }) => {
if (
knownFediverseServer.name !== "lemmy" &&
knownFediverseServer.name !== "lemmybb" &&
knownFediverseServer.name !== "kbin" &&
// knownFediverseServer.name !== "mbin" &&
knownFediverseServer.time &&
Date.now() - knownFediverseServer.time < CRAWL_AGED_TIME.FEDIVERSE // re-scan fedi servers to check their status
) {
Expand Down
Loading
Loading