Skip to content

Commit

Permalink
Merge pull request #204 from tgxn/kbin_mbin
Browse files Browse the repository at this point in the history
Add MBin Magazine Support (+ remove KBin support)
tgxn authored Jan 8, 2025
2 parents 0c6ca95 + 2d552e8 commit 3db14c2
Showing 44 changed files with 1,281 additions and 917 deletions.
31 changes: 16 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
[![publish-pages](https://github.com/tgxn/lemmy-explorer/actions/workflows/publish-pages.yaml/badge.svg)](https://github.com/tgxn/lemmy-explorer/actions/workflows/publish-pages.yaml)

# Lemmy Explorer https://lemmyverse.net/

Data Dumps: https://data.lemmyverse.net/

This project provides a simple way to explore Lemmy Instances and Communities.

![List of Communities](./docs/images/communities.png)

The project consists of four modules:

1. Crawler (NodeJS, Redis) `/crawler`
2. Frontend (ReactJS, MUI Joy, TanStack) `/frontend`
3. Deploy (Amazon CDK v2) `/cdk`
@@ -20,16 +22,18 @@ The project consists of four modules:
You can append `home_url` and (optionally) `home_type` to the URL to set the home instance and type.

`?home_url=lemmy.example.com`
`?home_url=kbin.example.com&home_type=kbin`
`?home_url=mbin.example.com&home_type=mbin`

> `home_type` supports "lemmy" and "kbin" (default is "lemmy")
> `home_type` supports "lemmy" and "mbin" (default is "lemmy")
### Q: **How does discovery work?**

It uses a [seed list of communities](https://github.com/tgxn/lemmy-explorer/blob/main/crawler/src/lib/const.js#L47) and scans the equivalent of the `/instances` federation lists, and then creates jobs to scan each of those servers.

Additionally, instance tags and trust data is fetched from [Fediseer](https://gui.fediseer.com/).

### Q: **How does the NSFW filter work?**

The NSFW filter is a client-side filter that filters out NSFW communities and instances from results by default.
The "NSFW Toggle" checkbox has thress states that you can toggle through:
| State | Filter | Value |
@@ -38,17 +42,18 @@ The "NSFW Toggle" checkbox has thress states that you can toggle through:
| One Click | Include NSFW | null |
| Two Clicks | NSFW Only | true |

When you try to switch to a non-sfw state, a popup will appear to confirm your choice. You can save your response in your browsers cache and it will be remembered.

When you try to switch to a non-sfw state, a popup will appear to confirm your choice. You can save your response in your browsers cache and it will be remembered.

### Q: **How long till my instance shows up?**

How long it takes to discover a new instance can vary depending on if you post content that's picked up by one of these servers.

Since the crawler looks at lists of federated instances, we can't discover instances that aren't on those lists.

Additionally, the lists are cached for 24 hours, so it can take up to 24 hours for an instance to show up after it's been discovered till it shows up.

### Q: **Can I use your data in my app/website/project?**

I do not own any of the data retrieved by the crawler, it is available from public endpoints on the source instances.

You are free to pull data from the GitHub pages site:
@@ -66,21 +71,21 @@ You can also download [Latest ZIP](https://nightly.link/tgxn/lemmy-explorer/work

`dist-json-bundle.zip` file contains the data in JSON format:

- `communities.full.json` - list of all communities
- `instances.full.json` - list of all instances
- `overview.json` - metadata and counts

- `communities.full.json` - list of all communities
- `instances.full.json` - list of all instances
- `overview.json` - metadata and counts

## Crawler

[Crawler README](./crawler/README.md)

## Frontend

[Frontend README](./frontend/README.md)

## Data Site
[Data Site README](./pages/README.md)


[Data Site README](./pages/README.md)

## Deploy

@@ -90,9 +95,6 @@ The deploy is an Amazon CDK v2 project that deploys the crawler and frontend to

then run `cdk deploy --all` to deploy the frontend to AWS.




## Similar Sites

- https://browse.feddit.de/
@@ -102,8 +104,8 @@ then run `cdk deploy --all` to deploy the frontend to AWS.
- https://browse.toast.ooo/
- https://lemmyfind.quex.cc/


## Lemmy Stats Pages

- https://lemmy.fediverse.observer/dailystats
- https://the-federation.info/platform/73
- https://fedidb.org/software/lemmy
@@ -117,4 +119,3 @@ then run `cdk deploy --all` to deploy the frontend to AWS.
# Credits

Logo made by Andy Cuccaro (@andycuccaro) under the CC-BY-SA 4.0 license.

33 changes: 17 additions & 16 deletions crawler/README.md
Original file line number Diff line number Diff line change
@@ -48,7 +48,7 @@ These immediately run a specific task.
| `--init` | Initialize queue with seed jobs |
| `--health` | Check worker health |
| `--aged` | Create jobs for aged instances and communities |
| `--kbin` | Create jobs for kbin communities |
| `--mbin` | Create jobs for mbin communities |
| `--uptime` | Immediately crawl uptime data |
| `--fedi` | Immediately crawl Fediseer data |

@@ -73,7 +73,7 @@ These start a worker that will run continuously, processing jobs from the releva
| `-w instance` | Crawl instances from the queue |
| `-w community` | Crawl communities from the queue |
| `-w single` | Crawl single communities from the queue |
| `-w kbin` | Crawl kbin communities from the queue |
| `-w mbin` | Crawl mbin communities from the queue |
| `-w cron` | Schedule all CRON jobs for aged instances and communities, etc |

#### **Examples**
@@ -94,7 +94,7 @@ These start a worker that will run a single job, then exit.
| `-m [i\|instance] <base_url>` | Crawl a single instance |
| `-m [c\|community] <base_url>` | Crawl a single instance's community list |
| `-m [s\|single] <base_url> <community_name>` | Crawl a single community, delete if not exists |
| `-m [k\|kbin] <base_url>` | Crawl a single community |
| `-m [m\|mbin] <base_url>` | Crawl a single mbin instance |

#### **Examples**

@@ -126,7 +126,7 @@ Crawlers are tasks created to perform an action, which could be crawling an inst
| `community` | Community Crawling |
| `fediseer` | Fediseer Crawling |
| `uptime` | Uptime Crawling |
| `kbin` | Kbin Crawling |
| `mbin` | MBin Crawling |

### Queues

@@ -137,7 +137,7 @@ Queues are where Tasks can be placed to be processed.
| `instance` | Crawl an instance |
| `community_list` | Crawl a community |
| `community_single` | Crawl a single community |
| `kbin` | Crawl a kbin community |
| `mbin` | Crawl a mbin instance |

## Storage

@@ -146,16 +146,17 @@ Redis is used to store crawled data.
You can use `docker compose up -d` to start a local redis server.
Data is persisted to a `.data/redis` directory.

| Redis Key | Description |
| -------------- | ----------------------------------------------------- |
| `attributes:*` | Tracked attribute sets _(change over time)_ |
| `community:*` | Community details |
| `deleted:*` | Deleted data _(recycle bin if something broken)_ |
| `error:*` | Exception details |
| `fediverse:*` | Fediverse data |
| `instance:*` | Instance details |
| `last_crawl:*` | Last crawl time for instances and communities |
| `magazine:*` | Magazine data _(kbin magazines)_ |
| `uptime:*` | Uptime data _(fetched from `api.fediverse.observer`)_ |
| Redis Key | Description |
| ----------------- | ----------------------------------------------------- |
| `attributes:*` | Tracked attribute sets _(change over time)_ |
| `community:*` | Community details |
| `deleted:*` | Deleted data _(recycle bin if something broken)_ |
| `error:*` | Exception details |
| `fediverse:*` | Fediverse data |
| `instance:*` | Instance details |
| `last_crawl:*` | Last crawl time for instances and communities |
| `mbin_instance:*` | MBin Instances |
| `magazine:*` | Magazine data _(mbin magazines)_ |
| `uptime:*` | Uptime data _(fetched from `api.fediverse.observer`)_ |

Most of the keys have sub keys for the instance `base_url` or community `base_url:community_name`.
8 changes: 4 additions & 4 deletions crawler/ecosystem.config.cjs
Original file line number Diff line number Diff line change
@@ -37,10 +37,10 @@ module.exports = {
},
{
...defaultOptions,
output: "./.data/logs/kbin.log",
name: "crawl-kbin",
args: ["-w", "kbin"],
instances: 4,
output: "./.data/logs/mbin.log",
name: "crawl-mbin",
args: ["-w", "mbin"],
instances: 2,
},
],
};
14 changes: 7 additions & 7 deletions crawler/src/bin/manual.ts
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@ import logging from "../lib/logging";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

export default async function runManualWorker(workerName: string, firstParam: string, secondParam: string) {
// scan one instance
@@ -36,12 +36,12 @@ export default async function runManualWorker(workerName: string, firstParam: st
});
}

// scan one kbin
else if (workerName == "k" || workerName == "kbin") {
logging.info(`Running Singel Q Scan KBIN Crawl for ${firstParam}`);
const crawlKBinManual = new KBinQueue(true, "kbin_manual");
await crawlKBinManual.createJob(firstParam, (resultData) => {
logging.info("KBIN Crawl Complete");
// scan one mbin
else if (workerName == "m" || workerName == "mbin") {
logging.info(`Running Single MBin Crawl for ${firstParam}`);
const crawlMBinManual = new MBinQueue(true, "mbin_manual");
await crawlMBinManual.createJob(firstParam, (resultData) => {
logging.info("MBIN Crawl Complete");
process.exit(0);
});
}
20 changes: 10 additions & 10 deletions crawler/src/bin/task.ts
Original file line number Diff line number Diff line change
@@ -5,14 +5,14 @@ import storage from "../lib/crawlStorage";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

import CrawlOutput from "../output/output";
import { syncCheckpoint } from "../output/sync_s3";

import CrawlUptime from "../crawl/uptime";
import CrawlFediseer from "../crawl/fediseer";
import CrawlKBin from "../crawl/kbin";
import CrawlMBin from "../crawl/mbin";

import CrawlAged from "../util/aged";
import Failures from "../util/failures";
@@ -99,11 +99,11 @@ export default async function runTask(taskName: string) {
...commSingleCounts,
});

const kbinQHealthCrawl = new KBinQueue(false);
const kbinQHeCounts = await kbinQHealthCrawl.queue.checkHealth();
const mbinQHealthCrawl = new MBinQueue(false);
const mbinQHeCounts = await mbinQHealthCrawl.queue.checkHealth();
healthData.push({
queue: "KBinQueue",
...kbinQHeCounts,
queue: "MBinQueue",
...mbinQHeCounts,
});

console.info("Queue Health Metrics");
@@ -122,10 +122,10 @@ export default async function runTask(taskName: string) {

break;

// create jobs for all known kbin instances
case "kbin":
const kbinScan = new CrawlKBin();
await kbinScan.createJobsAllKBin();
// create jobs for all known mbin instances
case "mbin":
const mbinScan = new CrawlMBin();
await mbinScan.createJobsAllMBin();

break;

23 changes: 12 additions & 11 deletions crawler/src/bin/worker.ts
Original file line number Diff line number Diff line change
@@ -7,12 +7,13 @@ import storage from "../lib/crawlStorage";
import InstanceQueue from "../queue/instance";
import CommunityQueue from "../queue/community_list";
import SingleCommunityQueue from "../queue/community_single";
import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

// used to create scheduled instance checks
import CrawlAged from "../util/aged";
import CrawlFediseer from "../crawl/fediseer";
import CrawlUptime from "../crawl/uptime";
import CrawlKBin from "../crawl/kbin";
import CrawlMBin from "../crawl/mbin";

import { syncCheckpoint } from "../output/sync_s3";

@@ -35,9 +36,9 @@ export default async function startWorker(startWorkerName: string) {
} else if (startWorkerName == "single") {
logging.info("Starting SingleCommunityQueue Processor");
new SingleCommunityQueue(true);
} else if (startWorkerName == "kbin") {
logging.info("Starting KBinQueue Processor");
new KBinQueue(true);
} else if (startWorkerName == "mbin") {
logging.info("Starting MBinQueue Processor");
new MBinQueue(true);
}

// cron worker
@@ -70,14 +71,14 @@ export default async function startWorker(startWorkerName: string) {
});
}

// shares CRON_SCHEDULES.KBIN
logging.info("Creating KBin Cron Task", CRON_SCHEDULES.KBIN);
cron.schedule(CRON_SCHEDULES.KBIN, async (time) => {
console.log("Running KBin Cron Task", time);
// shares CRON_SCHEDULES.MBIN
logging.info("Creating MBin Cron Task", CRON_SCHEDULES.MBIN);
cron.schedule(CRON_SCHEDULES.MBIN, async (time) => {
console.log("Running MBin Cron Task", time);
await storage.connect();

const kbinScan = new CrawlKBin();
await kbinScan.createJobsAllKBin();
const mbinScan = new CrawlMBin();
await mbinScan.createJobsAllMBin();

await storage.close();
});
3 changes: 2 additions & 1 deletion crawler/src/crawl/fediseer.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import logging from "../lib/logging";
import storage from "../lib/crawlStorage";
import { IFediseerInstanceData, IFediseerTag } from "../lib/storage/fediseer";

import { IFediseerInstanceData, IFediseerTag } from "../../../types/storage";

import CrawlClient from "../lib/CrawlClient";

18 changes: 9 additions & 9 deletions crawler/src/crawl/instance.ts
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@ import {
IErrorDataKeyValue,
ILastCrawlData,
ILastCrawlDataKeyValue,
} from "../lib/storage/tracking";
} from "../../../types/storage";

import { CRAWL_AGED_TIME } from "../lib/const";
import { HTTPError, CrawlError, CrawlTooRecentError } from "../lib/error";
@@ -19,21 +19,21 @@ import InstanceQueue from "../queue/instance";

import CrawlClient from "../lib/CrawlClient";

import KBinQueue from "../queue/kbin";
import MBinQueue from "../queue/mbin";

export default class InstanceCrawler {
private crawlDomain: string;
private logPrefix: string;

private kbinQueue: KBinQueue;
private mbinQueue: MBinQueue;

private client: CrawlClient;

constructor(crawlDomain: string) {
this.crawlDomain = crawlDomain;
this.logPrefix = `[Instance] [${this.crawlDomain}]`;

this.kbinQueue = new KBinQueue(false);
this.mbinQueue = new MBinQueue(false);

this.client = new CrawlClient();
}
@@ -109,10 +109,10 @@ export default class InstanceCrawler {
// store all fediverse instance software for easy metrics
await storage.fediverse.upsert(this.crawlDomain, nodeInfo.software);

// scan kbin instances that are found
if (nodeInfo.software.name == "kbin") {
console.log(`${this.crawlDomain}: found kbin instance - creating job`);
await this.kbinQueue.createJob(this.crawlDomain);
// scan mbin instances that are found
if (nodeInfo.software.name == "mbin") {
console.log(`${this.crawlDomain}: found mbin instance - creating job`);
await this.mbinQueue.createJob(this.crawlDomain);
}

// only allow lemmy instances
@@ -325,7 +325,7 @@ export const instanceProcessor: IJobProcessor = async ({ baseUrl }) => {
if (
knownFediverseServer.name !== "lemmy" &&
knownFediverseServer.name !== "lemmybb" &&
knownFediverseServer.name !== "kbin" &&
// knownFediverseServer.name !== "mbin" &&
knownFediverseServer.time &&
Date.now() - knownFediverseServer.time < CRAWL_AGED_TIME.FEDIVERSE // re-scan fedi servers to check their status
) {
Loading

0 comments on commit 3db14c2

Please sign in to comment.